Welcome to the Inedo Forums! Check out the Forums Guide for help getting started.

If you are experiencing any issues with the forum software, please visit the Contact Form on our website and let us know!

Linux Performance & SQL Server Issues


  • inedo-engineer

    I wasn't quite ready to write a blog post on this, but I figured I'd journal my thoughts on the forums to start - in the event that anyone is experience Linux performance issues with SQL Server.

    Background: Low Linux Servers for ProGet Cloud Trial Edition

    I've been "playing around" with low-spec Linux VMs (1-2 vCPU, 2-4 GB) for the purposes of creating "cloud trials" of ProGet. Eventually, I'd like to offer new users the ability to fill out a form on my.inedo.com that would automatically provision a server for them to evaluate.... but since we're paying for the resources, I want to keep our costs low.

    We would never recommend such a low spec for production usage, but its fine for testing. That is, until a powerful developer machine hits it with an onslaught of requests during a NuGet or npm restore. This means hundreds of requests per second that need to be proxied to public repositories if the packages aren't cached.

    To simulate this, I wrote a basic C# console program that just downloaded 1000's of packages in parallel from NuGet/npm feeds on ProGet with no caching enabled. This means that each request will trigger database queries and outbound requests to nuget.org. It's a DoS program, basically.

    My goal was to find out what combination of settings would allow ProGet to not totally crap out during evaluation usage, and maybe discover a way to warn users that ProGet is under heavy load.

    Unexpected Problem: Broken Microsoft SQL Server Driver

    WAS: Unexpected Problem: Linux SQL Server Warmup Required 🙄

    Turns out the errors I encountered were entirely related to the "new" SQL Server driver that ProGet 2024 was using. Although this was in beta for well over a year, it was "only" in production for a few months by the time we incorporated it into ProGet. This driver issue impacts anyone who uses .NET on Linux.

    As I mention below, I think it's pretty clear that Microsoft has effectively abandoned SQL Server. It's one thing to release something this low-quality and untested... and another to have it linger with issues like this in production for months. We will be moving to Postgres, which do not have these endemic problems.

    In the meantime, I have instituted the "one year rule" for Microsoft products - meaning, unless it's been shipped in their general release channel for a year, we will not even consider using it.

    I noticed that if I ran my DoS program shortly after the container started, I would see a flood of messages like this: Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)

    That was not at all expected.

    After researching this a bit, it looks like this is a well-known, multi-year bug in the Microsoft's SQL Server driver for Linux. It's documented in their GitHub repository, and it plagues even Microsoft's own business-line products. There's no ETA and virtually no response form Microsoft on the matter but it's endemic in their driver.

    Long story short, low-spec Linux instance of ProGet needs to "warmup". There is some kind of bug in the client that is triggered when you try to open a bunch of connections at the same time. But once the connections are open, it seems to be fine.

    In the short term, I don't know to address it. As long as the connections are opened "slowly enough" it's fine.... maybe? But no idea how to control that.

    Expected Problem: Too Many Requests

    A single, underspec'd machine can only handle so much load, and I wanted to find out what those limits were, and what errors I would get in "new user evaluation" scenario.

    On Linux, with my test scenario, this manifests with basically the same, "error 35". I believe it's socket exhaustion, but who knows? In most production cases, we see SQL/network timeout - this is because the database is usually filled with a ton of data, and is taking a bit longer to respond. In my test scenario, it wasn't.

    Non-solution: Slow the REquests

    When I added a client-slide throttle - or even spaced out issuing the requests by 10ms - there were virtually no errors. If only NuGet and npm clients behaved that way...

    Solution : Concurrent Request Limit

    Under Admin > HTTP/S & Certificate Settings > Web Server Configuration, you can set a value called "Concurrent Request Limit". That made all the difference after warming up.

    25 Requests worked like a charm. No matter what I threw at the machine, it could handle it. The web browsing experience was terrible, which makes me think we may want to create a separate throttle for web vs API in a case like this.

    250 Requests caused moderate errors. Performance was better while browsing the web, but I'd get an occasional 500 error.

    I wish I could give better advice, but "playing with this value" seems to be the best way to solve these problems in production. For our evaluation machines, I think 25ish is fine.

    Another Solution : Nginx Request Limiting

    I wanted to compare ProGet's built-in limiter with Nginx's rate limited when used as a reverse proxy. I played with it a bunch, and found that the same "25" setting basically eliminated errors on my micomachine.

    Here's my Nginx config file:

    limit_req_zone global zone=global_limit:10m rate=25r/s;
    
    server {
        listen 80;
        server_name localhost;
        location / {
            limit_req zone=global_limit burst=500 nodelay;
            proxy_pass http://127.0.0.1:8000;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
    }
    

    Interestingly enough, it also made the web browsing experience seem much better. I'll be honest, I have no idea how, but it has something to do with that nodelay keyword, and some algorithm they use.

    My conclusion is that Nginx does a better job of appearing to not be as slow, but like I mentioned earlier, having a separate queue for Web vs API requests would probably make our tiny Linux box run a lot better.

    Addressing Production Linux Performance Issues

    Ultimately, I think "playing" with request limiting and being aware of the "warm up" is important. A newly-spun up container may exhibit

    Long-term Solution: Moving away from SQL Server

    This pains me to admit as a lifelong Microsoft devotee, but it's time for us to move to Postgres. This quote from Brent Ozar, one of the world's most prominent Microsoft SQL Server consultants, sums the scenario up nicely:

    “[Microsoft SQL Server] shouldn’t be used for most new applications you build today. If you have an existing app built on it, you’re kinda stuck, but if you’re building a new application, you should use Postgres instead.”

    My opinion.... Microsoft has all-but given up on SQL Server; this issue in particular has not only been open for two years, but it impacts their own products -- and instead of fixing it, their own product teams simply are moving to other databases 🙄

    Many years back, we had a Postgres version of ProGet. We eventually dropped it because it's too much of a pain to maintain two sets of database code, and we never built the tools to migrate from one to the other.

    That's something we'll have to do when going back to Postgres -- build some sort of database migrator. We'll also need to support both versions for a bit, and eventually say goodbye to SQL Server. No timeline on this, but just something I've been thinking about lately.

    Anyway - just wanted to journal my thoughts, and the forums would be a nice place to post them - maybe I'll turn this to a blog or newsletter later :)

    Alex


  • inedo-engineer

    @apxltd heard this from a user on the support forums today...

    We also came across this issue: https://github.com/dotnet/SqlClient/issues/2378 which suggests it might have started with the recent 5.2.x versions of Microsoft.Data.SqlClient (I did confirm that this is the version Proget is using).

    Maybe we can downgrade to 5.1 and see if it helps at all



  • @apxltd said in Linux Performance & SQL Server Issues:

    Expected Problem: Too Many Requests

    A single, underspec'd machine can only handle so much load, and I wanted to find out what those limits were, and what errors I would get in "new user evaluation" scenario.

    On Linux, with my test scenario, this manifests with basically the same, "error 35". I believe it's socket exhaustion, but who knows? In most production cases, we see SQL/network timeout - this is because the database is usually filled with a ton of data, and is taking a bit longer to respond. In my test scenario, it wasn't.

    Non-solution: Slow the REquests

    When I added a client-slide throttle - or even spaced out issuing the requests by 10ms - there were virtually no errors. If only NuGet and npm clients behaved that way...

    Solution : Concurrent Request Limit

    Under Admin > HTTP/S & Certificate Settings > Web Server Configuration, you can set a value called "Concurrent Request Limit". That made all the difference after warming up.

    25 Requests worked like a charm. No matter what I threw at the machine, it could handle it. The web browsing experience was terrible, which makes me think we may want to create a separate throttle for web vs API in a case like this.

    250 Requests caused moderate errors. Performance was better while browsing the web, but I'd get an occasional 500 error.

    I wish I could give better advice, but "playing with this value" seems to be the best way to solve these problems in production. For our evaluation machines, I think 25ish is fine.

    This doesn't seem to be Linux-specific. We've run into performance issue with Angular projects (restoring several hundred npm packages all at once when building from scratch) in the past on our Windows server. The problem seemed to be the pool size of the SQL client.

    Our solution was to set the Concurrent Request Limit to 100, which seems to be the default pool size for the MS SQL client in .NET.

    Before setting the request limit, npm install was 4x slower than connecting to the original npm registry.

    After setting the request limit, connecting to our ProGet server was even faster than connecting to the original npm registry.

    This is on a virtual 2 Core, 8 GB RAM Windows Server 2016 hosting ProGet 2023 and the corresponding MS SQL Server 2016 Express Edition, if anyone is curious.


  • inedo-engineer

    @sebastian thanks for sharing!

    The issue that are describing sounds like more like the "ASP.NET Firehose" problem, and less like this obnoxious driver problem. They're both really obnoxious problems 👿

    The firehose problem hit basically anyone who went from .NET Framework to .NET Core, and had a high-traffic application. Basically, Microsoft gutted all of the "request throttling" that was in ASP.NET, which took into consideration a lot of factors, from system resources to processor type to cores, etc., and then queued requests appropriately --- now it's just left for the "user" to handle.

    It was on the .NET8 roadmap to get fixed, but I guess they gave us a broken sql driver instead.

    Anyway... great to hear the request throttle did the trick. I'm thinking we should just ship with a default of 100 and leave it at that. Maybe "wrap" some of the performance errors and advise the user to adjust the limit down or something.


  • inedo-engineer

    Turns out the errors I encountered were entirely related to the "new" SQL Server driver that ProGet 2024 was using. Although this was in beta for well over a year, it was "only" in production for a few months by the time we incorporated it into ProGet. This driver issue impacts anyone who uses .NET on Linux.

    As I mention, I think it's pretty clear that Microsoft has effectively abandoned SQL Server. It's one thing to release something this low-quality and untested... and another to have it linger with issues like this in production for months. We will be moving to Postgres, which do not have these endemic problems.

    In the meantime, I have instituted the "one year rule" for Microsoft products - meaning, unless it's been shipped in their general release channel for a year, we will not even consider using it.


Log in to reply
 

Inedo Website HomeSupport HomeCode of ConductForums GuideDocumentation