Welcome to the Inedo Forums! Check out the Forums Guide for help getting started.

If you are experiencing any issues with the forum software, please visit the Contact Form on our website and let us know!

500 error on "docker push" due to a deadlock



  • Hello!

    It seems that ProGet experiences a deadlock when we're uploading a bunch of images to the feed parallelly.

    We are starting around 10 parallel docker push-es with different images (but they share some layers) on our CI builds. Sometimes, it leads to an error.

    On the client side, the error looks that way:

        [1/1] ProcessException: Process 'docker' exited with code 1.
           > /usr/bin/docker push REDACTED/web:2022.45.0
           @ /buildAgent/work/20f1907942f0b034
        Error output:
           received unexpected HTTP status: 500 Internal Server Error
    

    On the ProGet side, I found the following error in logs (I set Diagnostics.FeedErrorLogging=* and Diagnostics.MinimumLogLevel=10):

    Logged:
    11.07.2022 17:20:35
    Level:
    Warning
    Category:
    Feed
    Message:
    A 500 error occurred in REDACTED_FEED_NAME: Transaction (Process ID 359) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
    Details:
    URL: https://REDACTED/v2/REDACTED/web/manifests/2022.45.0
    Referrer: (not set)
    User: (unknown)
    User Agent: docker/20.10.17 go/go1.17.11 git-commit/a89b842 kernel/4.19.0-20-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.17 \(linux\))
    Stack trace:
    

    We are getting that error almost every day in our CI builds.

    What can we do with that issue? Of course, we can upload those images sequentially, but there is still a chance that different CI builds will upload similar images at the same time.

    ProGet 6.0.16 with MS SQL Server


  • inedo-engineer

    Hi @inok_spb,

    This error looks to be coming from SQL Server itself. I think the first thing to check would be your Index Fragmentation in SQL Server. We have some information about this in our SQL Server Implementations with Inedo Tools. If you run the following SQL, you should be able to identify any issues with index fragmentation.

    SELECT dbschemas.[name] as 'Schema',
            dbtables.[name] as 'Table',
            dbindexes.[name] as 'Index',
            indexstats.avg_fragmentation_in_percent,
            indexstats.page_count
            FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL, NULL, NULL) AS indexstats
            INNER JOIN sys.tables dbtables on dbtables.[object_id] = indexstats.[object_id]
            INNER JOIN sys.schemas dbschemas on dbtables.[schema_id] = dbschemas.[schema_id]
            INNER JOIN sys.indexes AS dbindexes ON dbindexes.[object_id] = indexstats.[object_id]
            AND indexstats.index_id = dbindexes.index_id
            WHERE indexstats.database_id = DB_ID()
            ORDER BY indexstats.avg_fragmentation_in_percent desc
    

    Once you have pinpointed where your potential problems are you can address them with SQL Server Management Studio, under Tables > select table > Indexes > select index > Right-click > select Reorganize. See the Remove fragmentation using SQL Server Management Studio to learn more.

    Please let me know if you have any questions.

    Thanks,
    Dan



  • Hi @Dan_Woolf,

    Thank you for the advice to perform a reorganization of indices, we missed that article and actually didn't schedule regular reorganization. However, I believe the "deadlock" issue has nothing in common with index fragmentation, which affects only performance.

    So I reorganized all indices, and now the fragmentation is low. But it actually didn't help, the issue is still here. We have these errors with different docker pushes:

    1. A 500 error occurred in FEED_NAME: Transaction (Process ID 428) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
    2. A 500 error occurred in FEED_NAME: 12051351TR__DockerImages__ValidateUniqueDigest8`Transaction (Process ID 408) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

    The second error is more specific. It seems that ProGet doesn't handle possible deadlocks properly and doesn't retry deadlocked database operations.


  • inedo-engineer

    Hi @inok_spb

    I think you're right, something must be deadlocking.

    There was a database change in ProGet 6.0.16 that basically involved creating a large transaction to handle a race condition.

    https://inedo.myjetbrains.com/youtrack/issue/PG-2140

    We haven't heard of any other reports of this deadlock (particularly from our paid users)... and unfortunately deadlock issues are really hard and time-consuming to reproduce and track down (especially with Docker using shared layers).

    So as a free user, we'd really appreciate any other info / help you can provide to track this down and fix :)

    The issue would most certainly be in database code, which is viewable/editable in SSMS.

    One quick thing to try -- can you just disable TR__DockerRepositoryTags__ValidateImage? That might be the culprit... and if so, we can always add it to the stored procedures as well.

    Cheers,
    Alana



  • Hi @atripp,

    I'm a little worried about disabling that trigger on a production instance of ProGet. Is it safe to do so?

    Meanwhile, I also asked our DBA to monitor the database server and extract as much information as possible about the deadlocks. I'll provide all the details as soon as we get them.

    BTW, we've upgraded to ProGet 2022.2 several days ago, and the issue is still here.

    Thanks!


  • inedo-engineer

    Hi @inok_spb ,

    There won't be any issue in disabling that trigger. It's basically like a "foreign key constraint", and just checks for data validations. However, I suspect its where the problem is, so please give it a shot and let us know.

    We haven't had any other reports of this, tried to reproduce on our own, or fix it.... so it's not surprising if the issue is still there 😅

    Cheers,
    Alana



  • Hi @atripp,

    We still haven't disabled the trigger, so I can't say if it leads to the issue.

    But we caught some info about the deadlock. It occurs when we're uploading several images with mostly shared layers in parallel.
    As you can see, DockerImages_SetData conflicts with DockerImages_GetImage:

    <deadlock-list>
     <deadlock victim="process183bc43d848">
      <process-list>
       <process id="process183bc43d848" taskpriority="0" logused="0" waitresource="KEY: 9:72057594064601088 (43a62080b7c0)" waittime="2614" ownerId="11261239428" transactionname="SELECT" lasttranstarted="2022-08-23T15:15:11.593" XDES="0x189a543bbe8" lockMode="S" schedulerid="3" kpid="3328" status="suspended" spid="398" sbid="0" ecid="0" priority="0" trancount="0" lastbatchstarted="2022-08-23T15:15:11.593" lastbatchcompleted="2022-08-23T15:15:11.593" lastattention="2022-08-23T15:15:11.593" clientapp="Core Microsoft SqlClient Data Provider" hostname="TM" hostpid="5720" loginname="ProGetUser" isolationlevel="read committed (2)" xactid="11261239428" currentdb="9" currentdbname="ProGet" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
        <executionStack>
         <frame procname="ProGet.dbo.DockerImages_GetImage" line="13" stmtstart="304" stmtend="1188" sqlhandle="0x030009000ec450743d0ba900dfae000001000000000000000000000000000000000000000000000000000000">
    SELECT *
    	  FROM [DockerImages_Extended]
    	 WHERE [Feed_Id] = @Feed_Id
    	   AND [Repository_Name] = @Repository_Name
    	   AND [Image_Digest] = @TagOrDigest_Name
    
    	UNION
    	
    	SELECT DI.*
    	  FROM [DockerRepositoryTags] DRT
    	       INNER JOIN [DockerImages_Extended] DI
    		           ON DI.[DockerImage_Id] = DRT.[DockerImage_Id]
    	 WHERE DI.[Feed_Id] = @Feed_Id
    	   AND DI.[Repository_Name] = @Repository_Name
    	   AND DRT.[Tag_Name] = @TagOrDigest_Nam     </frame>
        </executionStack>
        <inputbuf>
    Proc [Database Id = 9 Object Id = 1951450126]    </inputbuf>
       </process>
       <process id="process183c2f8cca8" taskpriority="0" logused="736" waitresource="OBJECT: 9:1257107569:0 " waittime="2616" ownerId="11261239217" transactionname="user_transaction" lasttranstarted="2022-08-23T15:15:11.580" XDES="0x18865dd4428" lockMode="X" schedulerid="1" kpid="10636" status="suspended" spid="85" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2022-08-23T15:15:11.580" lastbatchcompleted="2022-08-23T15:15:11.580" lastattention="2022-08-23T15:15:11.573" clientapp="Core Microsoft SqlClient Data Provider" hostname="TM" hostpid="5720" loginname="ProGetUser" isolationlevel="read committed (2)" xactid="11261239217" currentdb="9" currentdbname="ProGet" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
        <executionStack>
         <frame procname="ProGet.dbo.DockerImages_SetData" line="74" stmtstart="4832" stmtend="5000" sqlhandle="0x03000900d72b8704a10ba900dfae000001000000000000000000000000000000000000000000000000000000">
    DELETE [DockerImageLayers] WITH (TABLOCKX)
    	 WHERE [DockerImage_Id] = @DockerImage_I     </frame>
        </executionStack>
        <inputbuf>
    Proc [Database Id = 9 Object Id = 75967447]    </inputbuf>
       </process>
      </process-list>
      <resource-list>
       <keylock hobtid="72057594064601088" dbid="9" objectname="ProGet.dbo.DockerImages" indexname="PK__DockerImages" id="lock1830340c380" mode="X" associatedObjectId="72057594064601088">
        <owner-list>
         <owner id="process183c2f8cca8" mode="X"/>
        </owner-list>
        <waiter-list>
         <waiter id="process183bc43d848" mode="S" requestType="wait"/>
        </waiter-list>
       </keylock>
       <objectlock lockPartition="0" objid="1257107569" subresource="FULL" dbid="9" objectname="ProGet.dbo.DockerImageLayers" id="lock18a7385bb00" mode="IS" associatedObjectId="1257107569">
        <owner-list>
         <owner id="process183bc43d848" mode="IS"/>
        </owner-list>
        <waiter-list>
         <waiter id="process183c2f8cca8" mode="X" requestType="wait"/>
        </waiter-list>
       </objectlock>
      </resource-list>
     </deadlock>
    </deadlock-list>
    

  • inedo-engineer

    Thanks for the update @inok_spb, that's quit helpful.

    Looking at the code, I think I could see how that could happen. The code isn't very pretty and it's a bit complex. Unfortunately, it's not simple to reproduce (for me), and the issue is low priority since no one else has reported (except just single free-edition user).

    However, if you can modify the stored procedures, then I can give you some pointers on what I would do to investigate. If you can fix the stored procedure on your server, then we can modify our source code, and the issue will be resolved!

    The first thing I would do is modify DockerImages_SetData as follows by moving the following line of code to the top of the code block, right below the BEGIN TRANSACTION statement:

    DELETE [DockerImageLayers] WITH (TABLOCKX)
     WHERE [DockerImage_Id] = @DockerImage_Id
    

    I don't think a TABLOCKX is appropriate here, but regardless -- moving this to the top should block DockerImages_GetImage until the procedure finishes. I don't see any other side-effects from making this change.

    If you have someone who is really skilled in SQL Server, then I'm sure they could do a better job than I would, but this is where I would start.

    But please try this and let me know!



  • Hi, @atripp.

    About the approach you suggested:
    Moving DELETE [DockerImageLayers] ... to the procedure's top seems wrong. Yes, it changes the order in which locks are taken and hopefully fixes the deadlock above. However, there is no guarantee that the conflicting procedure won't change its order of taking locks and deadlock again, or that there are no procedures that will start to conflict with that new order of operations.

    I investigated the issue further, and it seems that I found the right solution to the problem. The problem is with the TABLOCKX, and it seems that it can be safely removed. Why?

    If the procedure contained only that DELETE [DockerImageLayers] and INSERT INTO [DockerImageLayers], the TABLOCKX would prevent the scenario when two transactions change layers of the same image concurrently. I'm still not sure that it is actually possible to get a serialization anomaly in that case, but it looks suspicious, so TABLOCKX seems reasonable in that hypothetical case.

    However, prior to that DELETE [DockerImageLayers] WITH (TABLOCKX) WHERE [DockerImage_Id] = @DockerImage_Id we have that:

    UPDATE [DockerImages]
       SET [ContainerConfigBlob_Id] = @ContainerConfigBlob_Id,
           [ImageType_Code] = @ImageType_Code
     WHERE [DockerImage_Id] = @DockerImage_Id
    

    That UPDATE statement, if executed, exclusively locks the image it updates. Concurrent transactions will be blocked at that statement if they change the same image.
    So there is no way for multiple transactions changing the same image to pass that barrier and reach DELETE [DockerImageLayers] .... So TABLOCKX is meaningless in that case.

    The only thing I care about is that: will UPDATE [DockerImages] hold an exclusive lock on the image if the image actually hasn't changed (if ContainerConfigBlob_Id and ImageType_Code stays the same after the update)?
    My experiments show that the row is still locked even if it's not actually changed. However, I think it's reasonable to use UPDLOCK to guarantee that the row will be locked until the end of the transaction: UPDATE [DockerImages] WITH (UPDLOCK) ....

    I'll apply that solution (remove TABLOCKX and add UPDATE [DockerImages] WITH (UPDLOCK)) and let you know if it actually fixes the issue.


  • inedo-engineer

    @inok_spb thanks much for investigating this :)

    Your assessment makes a lot of sense and definitely isn't easy to figure out.

    Let me know if that works; it'll be easy to update our code once we know what works :)



  • Well, it works :)

    Before the fix, the error appeared in logs almost every day. I applied the fix a week ago, and the error is gone.

    Here are my changes in DockerImages_SetData:

    # UPDLOCK added
    UPDATE [DockerImages] WITH (UPDLOCK)
    SET [ContainerConfigBlob_Id] = @ContainerConfigBlob_Id,
        [ImageType_Code] = @ImageType_Code
    WHERE [DockerImage_Id] = @DockerImage_Id
    
    ...
    
    # TABLOCKX removed
    DELETE [DockerImageLayers] 
    WHERE [DockerImage_Id] = @DockerImage_Id
    
    

  • inedo-engineer

    Hi @inok_spb,

    That's great news! Thanks for the help and testing - couldn't have fixed this otherwise I think :)

    I just committed the code changes on our end (PG-2222), so it should be in the next maintenance release (2022.12).

    Cheers,
    Alana



  • Thank you for your help, @atripp!

    Just to be sure... When we update to 2022.12, my changes in the procedure will be overwritten by yours, right? Or should we do something special during the update?


  • inedo-engineer

    Hi @inok_spb - that's correct, the proc is dropped/created each time ProGet is upgraded/installed.


Log in to reply
 

Inedo Website HomeSupport HomeCode of ConductForums GuideDocumentation