500 error on "docker push" due to a deadlock

inok_spb · 11 Jul 2022, 15:24

Hello!

It seems that ProGet experiences a deadlock when we're uploading a bunch of images to the feed parallelly.

We are starting around 10 parallel docker push-es with different images (but they share some layers) on our CI builds. Sometimes, it leads to an error.

On the client side, the error looks that way:

    [1/1] ProcessException: Process 'docker' exited with code 1.
       > /usr/bin/docker push REDACTED/web:2022.45.0
       @ /buildAgent/work/20f1907942f0b034
    Error output:
       received unexpected HTTP status: 500 Internal Server Error

On the ProGet side, I found the following error in logs (I set Diagnostics.FeedErrorLogging=* and Diagnostics.MinimumLogLevel=10):

Logged:
11.07.2022 17:20:35
Level:
Warning
Category:
Feed
Message:
A 500 error occurred in REDACTED_FEED_NAME: Transaction (Process ID 359) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Details:
URL: https://REDACTED/v2/REDACTED/web/manifests/2022.45.0
Referrer: (not set)
User: (unknown)
User Agent: docker/20.10.17 go/go1.17.11 git-commit/a89b842 kernel/4.19.0-20-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.17 \(linux\))
Stack trace:

We are getting that error almost every day in our CI builds.

What can we do with that issue? Of course, we can upload those images sequentially, but there is still a chance that different CI builds will upload similar images at the same time.

ProGet 6.0.16 with MS SQL Server

Dan_Woolf · 12 Jul 2022, 13:37

Hi @inok_spb,

This error looks to be coming from SQL Server itself. I think the first thing to check would be your Index Fragmentation in SQL Server. We have some information about this in our SQL Server Implementations with Inedo Tools. If you run the following SQL, you should be able to identify any issues with index fragmentation.

SELECT dbschemas.[name] as 'Schema',
        dbtables.[name] as 'Table',
        dbindexes.[name] as 'Index',
        indexstats.avg_fragmentation_in_percent,
        indexstats.page_count
        FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL, NULL, NULL) AS indexstats
        INNER JOIN sys.tables dbtables on dbtables.[object_id] = indexstats.[object_id]
        INNER JOIN sys.schemas dbschemas on dbtables.[schema_id] = dbschemas.[schema_id]
        INNER JOIN sys.indexes AS dbindexes ON dbindexes.[object_id] = indexstats.[object_id]
        AND indexstats.index_id = dbindexes.index_id
        WHERE indexstats.database_id = DB_ID()
        ORDER BY indexstats.avg_fragmentation_in_percent desc

Once you have pinpointed where your potential problems are you can address them with SQL Server Management Studio, under Tables > select table > Indexes > select index > Right-click > select Reorganize. See the Remove fragmentation using SQL Server Management Studio to learn more.

Please let me know if you have any questions.

Thanks,
Dan

inok_spb · 22 Jul 2022, 10:03

Hi @Dan_Woolf,

Thank you for the advice to perform a reorganization of indices, we missed that article and actually didn't schedule regular reorganization. However, I believe the "deadlock" issue has nothing in common with index fragmentation, which affects only performance.

So I reorganized all indices, and now the fragmentation is low. But it actually didn't help, the issue is still here. We have these errors with different docker pushes:

A 500 error occurred in FEED_NAME: Transaction (Process ID 428) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
A 500 error occurred in FEED_NAME: 12051351TR__DockerImages__ValidateUniqueDigest8`Transaction (Process ID 408) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

The second error is more specific. It seems that ProGet doesn't handle possible deadlocks properly and doesn't retry deadlocked database operations.

atripp · 22 Jul 2022, 10:28

Hi @inok_spb

I think you're right, something must be deadlocking.

There was a database change in ProGet 6.0.16 that basically involved creating a large transaction to handle a race condition.

https://inedo.myjetbrains.com/youtrack/issue/PG-2140

We haven't heard of any other reports of this deadlock (particularly from our paid users)... and unfortunately deadlock issues are really hard and time-consuming to reproduce and track down (especially with Docker using shared layers).

So as a free user, we'd really appreciate any other info / help you can provide to track this down and fix :)

The issue would most certainly be in database code, which is viewable/editable in SSMS.

One quick thing to try -- can you just disable TR__DockerRepositoryTags__ValidateImage? That might be the culprit... and if so, we can always add it to the stored procedures as well.

Cheers,
Alana

inok_spb · 3 Aug 2022, 18:38

Hi @atripp,

I'm a little worried about disabling that trigger on a production instance of ProGet. Is it safe to do so?

Meanwhile, I also asked our DBA to monitor the database server and extract as much information as possible about the deadlocks. I'll provide all the details as soon as we get them.

BTW, we've upgraded to ProGet 2022.2 several days ago, and the issue is still here.

Thanks!

atripp · 4 Aug 2022, 09:56

Hi @inok_spb ,

There won't be any issue in disabling that trigger. It's basically like a "foreign key constraint", and just checks for data validations. However, I suspect its where the problem is, so please give it a shot and let us know.

We haven't had any other reports of this, tried to reproduce on our own, or fix it.... so it's not surprising if the issue is still there

Cheers,
Alana

inok_spb · 26 Aug 2022, 22:35

Hi @atripp,

We still haven't disabled the trigger, so I can't say if it leads to the issue.

But we caught some info about the deadlock. It occurs when we're uploading several images with mostly shared layers in parallel.
As you can see, DockerImages_SetData conflicts with DockerImages_GetImage:

<deadlock-list>
 <deadlock victim="process183bc43d848">
  <process-list>
   <process id="process183bc43d848" taskpriority="0" logused="0" waitresource="KEY: 9:72057594064601088 (43a62080b7c0)" waittime="2614" ownerId="11261239428" transactionname="SELECT" lasttranstarted="2022-08-23T15:15:11.593" XDES="0x189a543bbe8" lockMode="S" schedulerid="3" kpid="3328" status="suspended" spid="398" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="2022-08-23T15:15:11.593" lastbatchcompleted="2022-08-23T15:15:11.593" lastattention="2022-08-23T15:15:11.593" clientapp="Core Microsoft SqlClient Data Provider" hostname="TM" hostpid="5720" loginname="ProGetUser" isolationlevel="read committed (2)" xactid="11261239428" currentdb="9" currentdbname="ProGet" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
    <executionStack>
     <frame procname="ProGet.dbo.DockerImages_GetImage" line="13" stmtstart="304" stmtend="1188" sqlhandle="0x030009000ec450743d0ba900dfae000001000000000000000000000000000000000000000000000000000000">
SELECT *
	  FROM [DockerImages_Extended]
	 WHERE [Feed_Id] = @Feed_Id
	   AND [Repository_Name] = @Repository_Name
	   AND [Image_Digest] = @TagOrDigest_Name

	UNION
	
	SELECT DI.*
	  FROM [DockerRepositoryTags] DRT
	       INNER JOIN [DockerImages_Extended] DI
		           ON DI.[DockerImage_Id] = DRT.[DockerImage_Id]
	 WHERE DI.[Feed_Id] = @Feed_Id
	   AND DI.[Repository_Name] = @Repository_Name
	   AND DRT.[Tag_Name] = @TagOrDigest_Nam     </frame>
    </executionStack>
    <inputbuf>
Proc [Database Id = 9 Object Id = 1951450126]    </inputbuf>
   </process>
   <process id="process183c2f8cca8" taskpriority="0" logused="736" waitresource="OBJECT: 9:1257107569:0 " waittime="2616" ownerId="11261239217" transactionname="user_transaction" lasttranstarted="2022-08-23T15:15:11.580" XDES="0x18865dd4428" lockMode="X" schedulerid="1" kpid="10636" status="suspended" spid="85" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2022-08-23T15:15:11.580" lastbatchcompleted="2022-08-23T15:15:11.580" lastattention="2022-08-23T15:15:11.573" clientapp="Core Microsoft SqlClient Data Provider" hostname="TM" hostpid="5720" loginname="ProGetUser" isolationlevel="read committed (2)" xactid="11261239217" currentdb="9" currentdbname="ProGet" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
    <executionStack>
     <frame procname="ProGet.dbo.DockerImages_SetData" line="74" stmtstart="4832" stmtend="5000" sqlhandle="0x03000900d72b8704a10ba900dfae000001000000000000000000000000000000000000000000000000000000">
DELETE [DockerImageLayers] WITH (TABLOCKX)
	 WHERE [DockerImage_Id] = @DockerImage_I     </frame>
    </executionStack>
    <inputbuf>
Proc [Database Id = 9 Object Id = 75967447]    </inputbuf>
   </process>
  </process-list>
  <resource-list>
   <keylock hobtid="72057594064601088" dbid="9" objectname="ProGet.dbo.DockerImages" indexname="PK__DockerImages" id="lock1830340c380" mode="X" associatedObjectId="72057594064601088">
    <owner-list>
     <owner id="process183c2f8cca8" mode="X"/>
    </owner-list>
    <waiter-list>
     <waiter id="process183bc43d848" mode="S" requestType="wait"/>
    </waiter-list>
   </keylock>
   <objectlock lockPartition="0" objid="1257107569" subresource="FULL" dbid="9" objectname="ProGet.dbo.DockerImageLayers" id="lock18a7385bb00" mode="IS" associatedObjectId="1257107569">
    <owner-list>
     <owner id="process183bc43d848" mode="IS"/>
    </owner-list>
    <waiter-list>
     <waiter id="process183c2f8cca8" mode="X" requestType="wait"/>
    </waiter-list>
   </objectlock>
  </resource-list>
 </deadlock>
</deadlock-list>

atripp · 27 Aug 2022, 14:44

Thanks for the update @inok_spb, that's quit helpful.

Looking at the code, I think I could see how that could happen. The code isn't very pretty and it's a bit complex. Unfortunately, it's not simple to reproduce (for me), and the issue is low priority since no one else has reported (except just single free-edition user).

However, if you can modify the stored procedures, then I can give you some pointers on what I would do to investigate. If you can fix the stored procedure on your server, then we can modify our source code, and the issue will be resolved!

The first thing I would do is modify DockerImages_SetData as follows by moving the following line of code to the top of the code block, right below the BEGIN TRANSACTION statement:

DELETE [DockerImageLayers] WITH (TABLOCKX)
 WHERE [DockerImage_Id] = @DockerImage_Id

I don't think a TABLOCKX is appropriate here, but regardless -- moving this to the top should block DockerImages_GetImage until the procedure finishes. I don't see any other side-effects from making this change.

If you have someone who is really skilled in SQL Server, then I'm sure they could do a better job than I would, but this is where I would start.

But please try this and let me know!

inok_spb · 4 Oct 2022, 12:52

Hi, @atripp.

About the approach you suggested:
Moving DELETE [DockerImageLayers] ... to the procedure's top seems wrong. Yes, it changes the order in which locks are taken and hopefully fixes the deadlock above. However, there is no guarantee that the conflicting procedure won't change its order of taking locks and deadlock again, or that there are no procedures that will start to conflict with that new order of operations.

I investigated the issue further, and it seems that I found the right solution to the problem. The problem is with the TABLOCKX, and it seems that it can be safely removed. Why?

If the procedure contained only that DELETE [DockerImageLayers] and INSERT INTO [DockerImageLayers], the TABLOCKX would prevent the scenario when two transactions change layers of the same image concurrently. I'm still not sure that it is actually possible to get a serialization anomaly in that case, but it looks suspicious, so TABLOCKX seems reasonable in that hypothetical case.

However, prior to that DELETE [DockerImageLayers] WITH (TABLOCKX) WHERE [DockerImage_Id] = @DockerImage_Id we have that:

UPDATE [DockerImages]
   SET [ContainerConfigBlob_Id] = @ContainerConfigBlob_Id,
       [ImageType_Code] = @ImageType_Code
 WHERE [DockerImage_Id] = @DockerImage_Id

That UPDATE statement, if executed, exclusively locks the image it updates. Concurrent transactions will be blocked at that statement if they change the same image.
So there is no way for multiple transactions changing the same image to pass that barrier and reach DELETE [DockerImageLayers] .... So TABLOCKX is meaningless in that case.

The only thing I care about is that: will UPDATE [DockerImages] hold an exclusive lock on the image if the image actually hasn't changed (if ContainerConfigBlob_Id and ImageType_Code stays the same after the update)?
My experiments show that the row is still locked even if it's not actually changed. However, I think it's reasonable to use UPDLOCK to guarantee that the row will be locked until the end of the transaction: UPDATE [DockerImages] WITH (UPDLOCK) ....

I'll apply that solution (remove TABLOCKX and add UPDATE [DockerImages] WITH (UPDLOCK)) and let you know if it actually fixes the issue.

atripp · 5 Oct 2022, 01:08

@inok_spb thanks much for investigating this :)

Your assessment makes a lot of sense and definitely isn't easy to figure out.

Let me know if that works; it'll be easy to update our code once we know what works :)

inok_spb · 14 Oct 2022, 12:38

Well, it works :)

Before the fix, the error appeared in logs almost every day. I applied the fix a week ago, and the error is gone.

Here are my changes in DockerImages_SetData:

# UPDLOCK added
UPDATE [DockerImages] WITH (UPDLOCK)
SET [ContainerConfigBlob_Id] = @ContainerConfigBlob_Id,
    [ImageType_Code] = @ImageType_Code
WHERE [DockerImage_Id] = @DockerImage_Id

...

# TABLOCKX removed
DELETE [DockerImageLayers] 
WHERE [DockerImage_Id] = @DockerImage_Id

atripp · 17 Oct 2022, 15:12

Hi @inok_spb,

That's great news! Thanks for the help and testing - couldn't have fixed this otherwise I think :)

I just committed the code changes on our end (PG-2222), so it should be in the next maintenance release (2022.12).

Cheers,
Alana

inok_spb · 18 Oct 2022, 10:15

Thank you for your help, @atripp!

Just to be sure... When we update to 2022.12, my changes in the procedure will be overwritten by yours, right? Or should we do something special during the update?

atripp · 18 Oct 2022, 12:37

Hi @inok_spb - that's correct, the proc is dropped/created each time ProGet is upgraded/installed.