I wasn't quite ready to write a blog post on this, but I figured I'd journal my thoughts on the forums to start - in the event that anyone is experience Linux performance issues with SQL Server.
Background: Low Linux Servers for ProGet Cloud Trial Edition
I've been "playing around" with low-spec Linux VMs (1-2 vCPU, 2-4 GB) for the purposes of creating "cloud trials" of ProGet. Eventually, I'd like to offer new users the ability to fill out a form on my.inedo.com that would automatically provision a server for them to evaluate.... but since we're paying for the resources, I want to keep our costs low.
We would never recommend such a low spec for production usage, but its fine for testing. That is, until a powerful developer machine hits it with an onslaught of requests during a NuGet or npm restore. This means hundreds of requests per second that need to be proxied to public repositories if the packages aren't cached.
To simulate this, I wrote a basic C# console program that just downloaded 1000's of packages in parallel from NuGet/npm feeds on ProGet with no caching enabled. This means that each request will trigger database queries and outbound requests to nuget.org. It's a DoS program, basically.
My goal was to find out what combination of settings would allow ProGet to not totally crap out during evaluation usage, and maybe discover a way to warn users that ProGet is under heavy load.
Unexpected Problem: Broken Microsoft SQL Server Driver
WAS: Unexpected Problem: Linux SQL Server Warmup Required
Turns out the errors I encountered were entirely related to the "new" SQL Server driver that ProGet 2024 was using. Although this was in beta for well over a year, it was "only" in production for a few months by the time we incorporated it into ProGet. This driver issue impacts anyone who uses .NET on Linux.
As I mention below, I think it's pretty clear that Microsoft has effectively abandoned SQL Server. It's one thing to release something this low-quality and untested... and another to have it linger with issues like this in production for months. We will be moving to Postgres, which do not have these endemic problems.
In the meantime, I have instituted the "one year rule" for Microsoft products - meaning, unless it's been shipped in their general release channel for a year, we will not even consider using it.
I noticed that if I ran my DoS program shortly after the container started, I would see a flood of messages like this: Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
That was not at all expected.
After researching this a bit, it looks like this is a well-known, multi-year bug in the Microsoft's SQL Server driver for Linux. It's documented in their GitHub repository, and it plagues even Microsoft's own business-line products. There's no ETA and virtually no response form Microsoft on the matter but it's endemic in their driver.
Long story short, low-spec Linux instance of ProGet needs to "warmup". There is some kind of bug in the client that is triggered when you try to open a bunch of connections at the same time. But once the connections are open, it seems to be fine.
In the short term, I don't know to address it. As long as the connections are opened "slowly enough" it's fine.... maybe? But no idea how to control that.
Expected Problem: Too Many Requests
A single, underspec'd machine can only handle so much load, and I wanted to find out what those limits were, and what errors I would get in "new user evaluation" scenario.
On Linux, with my test scenario, this manifests with basically the same, "error 35". I believe it's socket exhaustion, but who knows? In most production cases, we see SQL/network timeout - this is because the database is usually filled with a ton of data, and is taking a bit longer to respond. In my test scenario, it wasn't.
Non-solution: Slow the REquests
When I added a client-slide throttle - or even spaced out issuing the requests by 10ms - there were virtually no errors. If only NuGet and npm clients behaved that way...
Solution : Concurrent Request Limit
Under Admin > HTTP/S & Certificate Settings > Web Server Configuration, you can set a value called "Concurrent Request Limit". That made all the difference after warming up.
25 Requests worked like a charm. No matter what I threw at the machine, it could handle it. The web browsing experience was terrible, which makes me think we may want to create a separate throttle for web vs API in a case like this.
250 Requests caused moderate errors. Performance was better while browsing the web, but I'd get an occasional 500
error.
I wish I could give better advice, but "playing with this value" seems to be the best way to solve these problems in production. For our evaluation machines, I think 25ish is fine.
Another Solution : Nginx Request Limiting
I wanted to compare ProGet's built-in limiter with Nginx's rate limited when used as a reverse proxy. I played with it a bunch, and found that the same "25" setting basically eliminated errors on my micomachine.
Here's my Nginx config file:
limit_req_zone global zone=global_limit:10m rate=25r/s;
server {
listen 80;
server_name localhost;
location / {
limit_req zone=global_limit burst=500 nodelay;
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Interestingly enough, it also made the web browsing experience seem much better. I'll be honest, I have no idea how, but it has something to do with that nodelay
keyword, and some algorithm they use.
My conclusion is that Nginx does a better job of appearing to not be as slow, but like I mentioned earlier, having a separate queue for Web vs API requests would probably make our tiny Linux box run a lot better.
Addressing Production Linux Performance Issues
Ultimately, I think "playing" with request limiting and being aware of the "warm up" is important. A newly-spun up container may exhibit
Long-term Solution: Moving away from SQL Server
This pains me to admit as a lifelong Microsoft devotee, but it's time for us to move to Postgres. This quote from Brent Ozar, one of the world's most prominent Microsoft SQL Server consultants, sums the scenario up nicely:
β[Microsoft SQL Server] shouldnβt be used for most new applications you build today. If you have an existing app built on it, youβre kinda stuck, but if youβre building a new application, you should use Postgres instead.β
My opinion.... Microsoft has all-but given up on SQL Server; this issue in particular has not only been open for two years, but it impacts their own products -- and instead of fixing it, their own product teams simply are moving to other databases
Many years back, we had a Postgres version of ProGet. We eventually dropped it because it's too much of a pain to maintain two sets of database code, and we never built the tools to migrate from one to the other.
That's something we'll have to do when going back to Postgres -- build some sort of database migrator. We'll also need to support both versions for a bit, and eventually say goodbye to SQL Server. No timeline on this, but just something I've been thinking about lately.
Anyway - just wanted to journal my thoughts, and the forums would be a nice place to post them - maybe I'll turn this to a blog or newsletter later :)
Alex