Thank you for your help, @atripp!
Just to be sure... When we update to 2022.12, my changes in the procedure will be overwritten by yours, right? Or should we do something special during the update?
Thank you for your help, @atripp!
Just to be sure... When we update to 2022.12, my changes in the procedure will be overwritten by yours, right? Or should we do something special during the update?
Well, it works :)
Before the fix, the error appeared in logs almost every day. I applied the fix a week ago, and the error is gone.
Here are my changes in DockerImages_SetData
:
# UPDLOCK added
UPDATE [DockerImages] WITH (UPDLOCK)
SET [ContainerConfigBlob_Id] = @ContainerConfigBlob_Id,
[ImageType_Code] = @ImageType_Code
WHERE [DockerImage_Id] = @DockerImage_Id
...
# TABLOCKX removed
DELETE [DockerImageLayers]
WHERE [DockerImage_Id] = @DockerImage_Id
Hi, @atripp.
About the approach you suggested:
Moving DELETE [DockerImageLayers] ...
to the procedure's top seems wrong. Yes, it changes the order in which locks are taken and hopefully fixes the deadlock above. However, there is no guarantee that the conflicting procedure won't change its order of taking locks and deadlock again, or that there are no procedures that will start to conflict with that new order of operations.
I investigated the issue further, and it seems that I found the right solution to the problem. The problem is with the TABLOCKX
, and it seems that it can be safely removed. Why?
If the procedure contained only that DELETE [DockerImageLayers]
and INSERT INTO [DockerImageLayers]
, the TABLOCKX
would prevent the scenario when two transactions change layers of the same image concurrently. I'm still not sure that it is actually possible to get a serialization anomaly in that case, but it looks suspicious, so TABLOCKX
seems reasonable in that hypothetical case.
However, prior to that DELETE [DockerImageLayers] WITH (TABLOCKX) WHERE [DockerImage_Id] = @DockerImage_Id
we have that:
UPDATE [DockerImages]
SET [ContainerConfigBlob_Id] = @ContainerConfigBlob_Id,
[ImageType_Code] = @ImageType_Code
WHERE [DockerImage_Id] = @DockerImage_Id
That UPDATE
statement, if executed, exclusively locks the image it updates. Concurrent transactions will be blocked at that statement if they change the same image.
So there is no way for multiple transactions changing the same image to pass that barrier and reach DELETE [DockerImageLayers] ...
. So TABLOCKX
is meaningless in that case.
The only thing I care about is that: will UPDATE [DockerImages]
hold an exclusive lock on the image if the image actually hasn't changed (if ContainerConfigBlob_Id
and ImageType_Code
stays the same after the update)?
My experiments show that the row is still locked even if it's not actually changed. However, I think it's reasonable to use UPDLOCK to guarantee that the row will be locked until the end of the transaction: UPDATE [DockerImages] WITH (UPDLOCK) ...
.
I'll apply that solution (remove TABLOCKX
and add UPDATE [DockerImages] WITH (UPDLOCK)
) and let you know if it actually fixes the issue.
Hi @atripp,
We still haven't disabled the trigger, so I can't say if it leads to the issue.
But we caught some info about the deadlock. It occurs when we're uploading several images with mostly shared layers in parallel.
As you can see, DockerImages_SetData
conflicts with DockerImages_GetImage
:
<deadlock-list>
<deadlock victim="process183bc43d848">
<process-list>
<process id="process183bc43d848" taskpriority="0" logused="0" waitresource="KEY: 9:72057594064601088 (43a62080b7c0)" waittime="2614" ownerId="11261239428" transactionname="SELECT" lasttranstarted="2022-08-23T15:15:11.593" XDES="0x189a543bbe8" lockMode="S" schedulerid="3" kpid="3328" status="suspended" spid="398" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="2022-08-23T15:15:11.593" lastbatchcompleted="2022-08-23T15:15:11.593" lastattention="2022-08-23T15:15:11.593" clientapp="Core Microsoft SqlClient Data Provider" hostname="TM" hostpid="5720" loginname="ProGetUser" isolationlevel="read committed (2)" xactid="11261239428" currentdb="9" currentdbname="ProGet" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
<executionStack>
<frame procname="ProGet.dbo.DockerImages_GetImage" line="13" stmtstart="304" stmtend="1188" sqlhandle="0x030009000ec450743d0ba900dfae000001000000000000000000000000000000000000000000000000000000">
SELECT *
FROM [DockerImages_Extended]
WHERE [Feed_Id] = @Feed_Id
AND [Repository_Name] = @Repository_Name
AND [Image_Digest] = @TagOrDigest_Name
UNION
SELECT DI.*
FROM [DockerRepositoryTags] DRT
INNER JOIN [DockerImages_Extended] DI
ON DI.[DockerImage_Id] = DRT.[DockerImage_Id]
WHERE DI.[Feed_Id] = @Feed_Id
AND DI.[Repository_Name] = @Repository_Name
AND DRT.[Tag_Name] = @TagOrDigest_Nam </frame>
</executionStack>
<inputbuf>
Proc [Database Id = 9 Object Id = 1951450126] </inputbuf>
</process>
<process id="process183c2f8cca8" taskpriority="0" logused="736" waitresource="OBJECT: 9:1257107569:0 " waittime="2616" ownerId="11261239217" transactionname="user_transaction" lasttranstarted="2022-08-23T15:15:11.580" XDES="0x18865dd4428" lockMode="X" schedulerid="1" kpid="10636" status="suspended" spid="85" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2022-08-23T15:15:11.580" lastbatchcompleted="2022-08-23T15:15:11.580" lastattention="2022-08-23T15:15:11.573" clientapp="Core Microsoft SqlClient Data Provider" hostname="TM" hostpid="5720" loginname="ProGetUser" isolationlevel="read committed (2)" xactid="11261239217" currentdb="9" currentdbname="ProGet" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
<executionStack>
<frame procname="ProGet.dbo.DockerImages_SetData" line="74" stmtstart="4832" stmtend="5000" sqlhandle="0x03000900d72b8704a10ba900dfae000001000000000000000000000000000000000000000000000000000000">
DELETE [DockerImageLayers] WITH (TABLOCKX)
WHERE [DockerImage_Id] = @DockerImage_I </frame>
</executionStack>
<inputbuf>
Proc [Database Id = 9 Object Id = 75967447] </inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057594064601088" dbid="9" objectname="ProGet.dbo.DockerImages" indexname="PK__DockerImages" id="lock1830340c380" mode="X" associatedObjectId="72057594064601088">
<owner-list>
<owner id="process183c2f8cca8" mode="X"/>
</owner-list>
<waiter-list>
<waiter id="process183bc43d848" mode="S" requestType="wait"/>
</waiter-list>
</keylock>
<objectlock lockPartition="0" objid="1257107569" subresource="FULL" dbid="9" objectname="ProGet.dbo.DockerImageLayers" id="lock18a7385bb00" mode="IS" associatedObjectId="1257107569">
<owner-list>
<owner id="process183bc43d848" mode="IS"/>
</owner-list>
<waiter-list>
<waiter id="process183c2f8cca8" mode="X" requestType="wait"/>
</waiter-list>
</objectlock>
</resource-list>
</deadlock>
</deadlock-list>
Hi @atripp,
I'm a little worried about disabling that trigger on a production instance of ProGet. Is it safe to do so?
Meanwhile, I also asked our DBA to monitor the database server and extract as much information as possible about the deadlocks. I'll provide all the details as soon as we get them.
BTW, we've upgraded to ProGet 2022.2 several days ago, and the issue is still here.
Thanks!
Hi @Dan_Woolf,
Thank you for the advice to perform a reorganization of indices, we missed that article and actually didn't schedule regular reorganization. However, I believe the "deadlock" issue has nothing in common with index fragmentation, which affects only performance.
So I reorganized all indices, and now the fragmentation is low. But it actually didn't help, the issue is still here. We have these errors with different docker pushes:
13
51TR__DockerImages__ValidateUniqueDigest
8`Transaction (Process ID 408) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.The second error is more specific. It seems that ProGet doesn't handle possible deadlocks properly and doesn't retry deadlocked database operations.
Hello!
It seems that ProGet experiences a deadlock when we're uploading a bunch of images to the feed parallelly.
We are starting around 10 parallel docker push
-es with different images (but they share some layers) on our CI builds. Sometimes, it leads to an error.
On the client side, the error looks that way:
[1/1] ProcessException: Process 'docker' exited with code 1.
> /usr/bin/docker push REDACTED/web:2022.45.0
@ /buildAgent/work/20f1907942f0b034
Error output:
received unexpected HTTP status: 500 Internal Server Error
On the ProGet side, I found the following error in logs (I set Diagnostics.FeedErrorLogging=* and Diagnostics.MinimumLogLevel=10):
Logged:
11.07.2022 17:20:35
Level:
Warning
Category:
Feed
Message:
A 500 error occurred in REDACTED_FEED_NAME: Transaction (Process ID 359) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Details:
URL: https://REDACTED/v2/REDACTED/web/manifests/2022.45.0
Referrer: (not set)
User: (unknown)
User Agent: docker/20.10.17 go/go1.17.11 git-commit/a89b842 kernel/4.19.0-20-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.17 \(linux\))
Stack trace:
We are getting that error almost every day in our CI builds.
What can we do with that issue? Of course, we can upload those images sequentially, but there is still a chance that different CI builds will upload similar images at the same time.
ProGet 6.0.16 with MS SQL Server
Actually, in our case, broken files are stored in the ProGet storage on a disk, I checked them during the investigation.
So it's not an issue that happens on the fly during the download process. And it's also not an issue that happens during package uploading, since after uploading, packages are valid and identical to the original ones on downloading and in the feed's storage.
Maybe, those packages were broken in the feed's storage a relatively long time ago for some other reason (a bug in earlier versions of ProGet? a bug in our infrastructure?). And the issue starts to be observable only now because of caches or something like that...
At this point, thanks to your answers, it seems that ProGet is not the one to blame.
So, I ran a script yesterday that found all the broken packages in the feed, downloaded them from nuget.org, and republished them to ProGet. It was about 100 packages overall.
Now I'm sure that every package in our feed is valid. I'll keep monitoring the feed, and if new broken packages arise in the feed, I'll continue the investigation and let you know if ProGet is somehow related to the issue.
Thanks a lot, @atripp!
Hi @atripp ,
Thank you for the fast response!
No, it's definitely not the "Repackage" feature, we don't use it. Also, we don't re-sign packages, the promotion from nuget.org to ProGet is just a pair of native nuget restore
(to also collect all dependencies of the package) and nuget push
of collected .nupkg
's.
So after uploading to ProGet, packages downloaded from nuget.org and ProGet are actually identical.
You say that ProGet doesn't modify files after uploading, only during it. But can ProGet strip and therefore modify packages after uploading in case of triggering re-indexing in a "symbol server" settings? If it's so, then maybe one of the stripping options was enabled some time ago and re-indexing was triggered... It explains how already uploaded packages could be broken.
Also, you say:
However, this wouldn't alter how existing items are stored in the archive file.... those would be streamed as-is.
Just in case: do you mean that not only the content of a file, but also the compression type of every file in the .nupkg
will be preserved as-is, and ProGet handles that explicitly?
Because, if the new package is just a 'new' zip archive with all files that weren't filtered out by stripping features, and the compression type of ZIP entries is not preserved explicitly on a per-file basis, then in the end we will get a package with a signature file that is binary identical to the original after decompression, but in the archive, it's compressed using Deflate method, just like all other files inside .nupkg
file.
Thank you!
Hello!
We have a problem with NuGet feed in ProGet.
We use it to store public NuGet packages from nuget.org. The only thing that we do with that packages is that we download them from nuget.org and upload them to ProGet.
Initially, they downloaded from ProGet correctly, but after some time, for an unknown reason, some packages cannot pass validation with the following error:
C:\Nuget\nuget.exe verify -All .\Microsoft.AspNetCore.Http.Features.2.1.1.nupkg
Verifying Microsoft.AspNetCore.Http.Features.2.1.1
...\Microsoft.AspNetCore.Http.Features.2.1.1.nupkg
Finished with 1 errors and 0 warnings.
NU3005: The package signature file entry is invalid. The central directory header field 'compression method' has an invalid value (8).
It turns out that in such broken NuGet packages, the signature file is compressed with Deflate method. In valid NuGet packages, the signature file is not actually compressed.
I also described the issue with NuGet packages here: https://github.com/NuGet/Home/issues/11861
It seems that nuget.exe
itself should support signatures compressed in such a way since they are perfectly valid after decompression. However, they are not supported now.
So the question is, how is it possible that ProGet after some time repackages NuGet packages and compresses signatures with Deflate method?
The only idea I have is that, because we enabled Symbol Server, ProGet somehow lazily processes NuGet packages to extract metadata or so, and updates packages, breaking signature. Maybe, it happens when someone tries to access symbols of the specific package.
Nevertheless, we disabled "Strip symbol files", "Strip source code" and "Remove signature file" for the Symbol server, so it seems that there is no reason for ProGet to overwrite packages.
What can we do?