Welcome to the Inedo Forums! Check out the Forums Guide for help getting started.

If you are experiencing any issues with the forum software, please visit the Contact Form on our website and let us know!

Database connection/healthcheck issues when SQL server isntance is switched in an Availability group.



  • Hello,

    We have posted about this issue before but would like to post about it again as we noticed something different. When doing maintenance on the SQL server and flipping the primary in the Availability group, it seems like an instance of proget gets hung up. Some come back fine, some get stuck on a message about the database being in an availability group. When this happens, the main page of proget shows a 500 error with the error message (which is great), however, on that same instance while the main page returns the 500, the /health endpoint returns a 200 and says the database status is 'ok'. It even shows that as being the case with the dbcache=false option.

    It almost seems like Proget has one connection to the database and the healthcheck has another so the healthcheck isn't actually reporting anything when proget itself has a problem. If this is going to be consistant behavior we can just hit / for our healthcheck in HAProxy instead of /health but that seems counter-intuitive.

    Thanks!


  • inedo-engineer

    Hi @cshipley_6136 ,

    When you specify dbcache=false, then ProGet will immediately invalidate the cache used for database connection-testing. This is otherwise cached for up to 5 minutes. Otherwise, there isn't anything that ProGet (as in the code we write) that would cache or hold any connection.

    So, this is ultimately related to ADO.NET's internal DNS caching / connection pooling /etc., and it would likely resolve within a few minutes on its own. But it's hard to say exactly what's going on. ADO.NET is the "driver" that's used by all .NET-based applications to connect to SQL Server, so I suspect this is solvable with configuration of some kind.

    Unfortunately, I'm not sure on your configuration nor advanced SQL Clustering scenarios. On our end, we just "let the driver" do the work and follow ADO.NET best practices with regards to driver usage.

    I did some initial searching, and I wonder if you just need additional configuration on your server connection string, like MultiSubnetFailover =true. They discuss a lot of information on it: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/sqlclient-support-for-high-availability-disaster-recovery

    Happy to help parse some of it, but please keep in mind this is a "driver" issue and we're learning about the internals for the first time, along with you. But the MultiSubnetFailover may help for a faster cut-over??

    NOTE: if / failed but /health didn't, then it's most certainly related to this internal connection pooling, and it wouldn't be consistant at all. I would most definitely not hit the / page for a health check, as that will introduce a lot of load on your instance and cause reliability issues.



  • Alright, so we have setup a dedicated listener for the proget database and updated the connection string accordingly. We have also added the MultiSubnetFailover=True parameter to the connection string as well. It has been running like this for a couple weeks.

    The interesting thing is Today we had an issue with one of the instances where / was showing a 500 error with the following message "A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server: Could not open a connection to SQL Server) For more information, visit the Error Log Page."

    While the / was showing this, the /health endpoint was showing Database: OK and so did the /health?dbcache=fase

    So nothing pulled this instance out of the pool on the load balancer.

    What's odd is nothing really change infrastructure wise to cause a connectivity problem.

    All that being said, the biggest issue we need to resolve is what is the proper way to do health check to ensure if there is an error it does indeed get pulled out of the backend of the load balancer as the /health endpoint doesn't seem to give accurate information.

    I am also seeing the following in logs happening every 30 seconds or so.

    2023-11-28T15:23:04.177166423-05:00 Running Failover Detection...
    2023-11-28T15:23:04.177166423-05:00 Running Failover Detection...
    2023-11-28T15:23:04.179161015-05:00 Failover Detection completed.
    2023-11-28T15:23:04.179161015-05:00 Failover Detection completed.
    

  • inedo-engineer

    Hi @cshipley_6136 ,

    Unfortunately we're really at a loss for how to troubleshoot the SQL Server further, and it's happening at the driver level. I'm not trying to "pass the buck" here, but it's most definitely NOT a ProGet issue (as in, code that we wrote and have control over).

    This must be happening at a lower-level (like the sql server driver, dns resolution, kubernetes, front-end caching, etc).

    Most likely, the / vs /health is related to caching an error, or some "deep internal" behavior of the SQL server driver (or bug on Linux?) that we're not aware of.

    From here, it's going to make sense to bring in Microsoft, who will know how to troubleshoot/diagnose this further. The "TCP Provider, error: 40" is so generic, and means the same thing as Chrome's "name not resolved". Aside from the obvious, we have no idea what could cause such an error, nor do we know how to troubleshoot.

    They may have some "secret flags" in the connection string that you can enable, etc.

    I tried to search for help, but there are like one thousand results that are all over the place, from clearing DNS caching to rebooting, etc.

    On our end, our code doesn't deal with any of these things... we just use the driver to invoke SQL Server commands. Here is the entirety of the /health code:

        private static readonly LazyCached<IList<Feed>> feeds = new(() => DB.Feeds_GetFeeds(false).Select(f => Feed.GetFeed(f)).ToList());
    
        protected override async Task ProcessRequestAsync(AhHttpContext context)
        {
            context.Response.ContentType = "application/json";
    
            using var writer = new JsonTextWriter(new StreamWriter(context.Response.OutputStream, InedoLib.UTF8Encoding)) { Formatting = Formatting.Indented };
    
            // added for specific customer and not documented (EDO-9257)
            if (string.Equals(context.Request.QueryString["dbcache"], "false", StringComparison.InvariantCultureIgnoreCase))
                feeds.Invalidate();
    
            writer.WriteStartObject();
    
            writer.WritePropertyName("applicationName");
            writer.WriteValue("ProGet");
    
            try
            {
                var _ = feeds.Value;
    
                writer.WritePropertyName("databaseStatus");
                writer.WriteValue("OK");
    
                writer.WritePropertyName("databaseStatusDetails");
                writer.WriteNull();
            }
            catch (Exception ex)
            {
                writer.WritePropertyName("databaseStatus");
                writer.WriteValue("Error");
    
                writer.WritePropertyName("databaseStatusDetails");
                writer.WriteValue(ex.Message);
            }
    
            writer.WritePropertyName("extensionsInstalled");
            writer.WriteStartObject();
            foreach (var e in ExtensionsManager.GetExtensions())
            {
                writer.WritePropertyName(e.Name);
                writer.WriteValue(e.Version.ToString());
            }
            writer.WriteEndObject();
    
            var license = LicensingInformation.Current;
    
            writer.WritePropertyName("licenseStatus");
            if (!license.IsValid)
            {
                writer.WriteValue("Error");
    
                writer.WritePropertyName("licenseStatusDetail");
                writer.WriteValue(license.LicenseKeyStatusDescription);
            }
            else
            {
                writer.WriteValue("OK");
    
                writer.WritePropertyName("licenseStatusDetail");
                writer.WriteNull();
            }
    
            writer.WritePropertyName("versionNumber");
            writer.WriteValue(versionNumber.Value);
    
            writer.WritePropertyName("releaseNumber");
            writer.WriteValue(releaseNumber.Value);
    
            var serviceStatus = await ProGetServiceMessenger.GetStatusAsync();
    
            writer.WritePropertyName("serviceStatus");
            writer.WriteValue(serviceStatus.Status == ExtendedServiceStatus.Running ? "OK" : "Error");
    
            writer.WritePropertyName("serviceStatusDetail");
            writer.WriteValue(serviceStatus.Status != ExtendedServiceStatus.Running ? serviceStatus.ErrorText : null);
    
            var r = replicationStatus.Value;
            writer.WritePropertyName("replicationStatus");
            if (r != null)
            {
                writer.WriteRawValue(JsonConvert.SerializeObject(r, Formatting.Indented));
            }
            else
            {
                writer.WriteNull();
            }
    
    
            writer.WriteEndObject();
           
        }
    

    To "translate" what's happening, the stored procedure Feeds_GetFeeds is invoked to test database connectivity. It's invoked following all of Microsoft's guidance for using the SQL Server driver.



  • Here are some other details of the error that may help. There seem to be two errors happening at once, they have the same timestamp in logs. They are probably related/the same.

    Category: WEB
    
    Message: An error occurred processing a GET request to http://subdomain.tld/nuget/approved-nuget/v3/registrations-gz/system.security.cryptography.algorithms/index.json: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server: Could not open a connection to SQL Server)
    
    Details: Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server: Could not open a connection to SQL Server)
    at 
    Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
    at 
    Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
    at 
    Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
    at 
    Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
    at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
    at 
    Microsoft.Data.SqlClient.SqlConnection.InternalOpenAsync(CancellationToken cancellationToken)
    --- End of stack trace from previous location ---
    at 
    Inedo.Data.SqlServerDatabaseContext.CreateConnectionAsync()
    at 
    Inedo.Data.DatabaseContext.ExecuteInternalAsync(String storedProcName, GenericDbParameter[] parameters)
    at 
    Inedo.Data.SqlServerDatabaseContext.ExecuteInternalAsync(String storedProcName, GenericDbParameter[] parameters)
    at 
    Inedo.Data.DatabaseContext.ExecuteTableAsync[TRow](String storedProcName, GenericDbParameter[] parameters)
    at 
    Inedo.ProGet.Feeds.NuGet.NuGetFeed.GetPackageV3Async(String packageId) in C:\Users\builds\AppData\Local\Temp\InedoAgent\BuildMaster\192.168.44.60\Temp\_E340542\Src\ProGetCoreEx\Feeds\NuGet\NuGetFeed.cs:line 230
    at
    Inedo.ProGet.WebApplication.FeedEndpoints.NuGet.V3.PackageMetadataHandler.ProcessRequestAsync(NuGetFeed feed, AhHttpContext context, WebApiContext apiContext, String relativeUrl) in C:\Users\builds\AppData\Local\Temp\InedoAgent\BuildMaster\192.168.44.60\Temp\_E340542\Src\ProGet.WebApplication\FeedEndpoints\NuGet\V3\PackageMetadataHandler.cs:line 35
    at 
    Inedo.ProGet.WebApplication.FeedEndpoints.NuGet.NuGetFeedHandler.ProcessRequestAsync(AhHttpContext context, WebApiContext apiContext, NuGetFeed feed, String relativeUrl) in C:\Users\builds\AppData\Local\Temp\InedoAgent\BuildMaster\192.168.44.60\Temp\_E340542\Src\ProGet.WebApplication\FeedEndpoints\NuGet\NuGetFeedHandler.cs:line 30
    at 
    Inedo.ProGet.WebApplication.FeedEndpoints.FeedEndpointHandler.FeedRequestHandler.ProcessRequestAsync(AhHttpContext context) in C:\Users\builds\AppData\Local\Temp\InedoAgent\BuildMaster\192.168.44.60\Temp\_E340542\Src\ProGet.WebApplication\FeedEndpoints\FeedEndpointHandler.cs:line 188
    ClientConnectionId:00000000-0000-0000-0000-000000000000
    
    Category: Web
    
    Message: An error occurred in the web application: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server: Could not open a connection to SQL Server)
    
    Details:
    URL: http://subdomain.tld/nuget/approved-nuget/v3/registrations-gz/microsoft.identitymodel.logging/index.json
    Referrer: (not set)
    User: (unknown)
    User Agent: NuGet VS VSIX/6.7.0 (Microsoft Windows NT 10.0.19045.0, Professional/17.0)
    Stack trace: at 
    Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
    at 
    Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
    at 
    Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
    at 
    Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
    at 
    Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
    at 
    Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
    at 
    Inedo.Data.SqlServerDatabaseContext.CreateConnection()
    at 
    Inedo.Data.DatabaseContext.ExecuteInternal(String storedProcName, GenericDbParameter[] parameters)
    at 
    Inedo.Data.SqlServerDatabaseContext.ExecuteInternal(String storedProcName, GenericDbParameter[] parameters)
    at 
    Inedo.Data.StrongDataReader.Read[TRow](Func`1 getReader, Boolean disposeReader)+MoveNext()
    at 
    System.Linq.Enumerable.TryGetFirst[TSource](IEnumerable`1 source, Boolean& found)
    at 
    Inedo.ProGet.Data.DB.Context.Feeds_GetFeed(Nullable`1 Feed_Id, String Feed_Name) in C:\Users\builds\AppData\Local\Temp\InedoAgent\BuildMaster\192.168.44.60\Temp\_E340542\Src\ProGetCoreEx\InedoLib.Analyzers\InedoLib.Analyzers.DatabaseContextGenerator\DB.g.cs:line 5726
    at 
    Inedo.ProGet.WebApplication.FeedEndpoints.FeedEndpointHandler.GetHttpHandler(GetHandlerParams args) in C:\Users\builds\AppData\Local\Temp\InedoAgent\BuildMaster\192.168.44.60\Temp\_E340542\Src\ProGet.WebApplication\FeedEndpoints\FeedEndpointHandler.cs:line 43
    at 
    Inedo.Web.Handlers.DynamicHttpHandling.GetHandler(AhHttpContext context, String requestType, String url, String pathTranslated)
    at Inedo.Web.AhWebMiddleware.InvokeAsync(HttpContext context)
    
    

  • inedo-engineer

    Thanks @cshipley_6136 ; they're both the same "error 40" - which is a hostname/dns issue.

    I doubt its a bug in the SQL Server driver (but it could be)... the most likely culprit is some kind of networking/configuration issues.

    I would try to identify what's happening around that timestamp, network changes, etc. This is where our expertise in troubleshooting is not so strong. It's a really easy error to reproduce (just enter a wrong host name in the connection string, disable dns, unplug a network cable, etc), but tracing the underlying cause requires lots of poking around on the network.



  • @atripp So what do you guys recommend for HA Proxy to do a health check? What is causing the connection error to SQL seems to be a separate issue than the /health endpoint not reporting accurately as everything in the /health endpoint object is OK. The code you posed seems to do it's own call to the the get_feeds proc, that leads me to think that the connection its getting from the connection pool for the health check is working fine but somewhere in the application it's getting an error. Can something be added to the health check to account for this scenario? Like maybe a flag that gets set on these errors that the health check endpoint also looks at?


  • inedo-engineer

    Hi @cshipley_6136 ,

    We recommend using the /health monitoring endpoint. As you can see, it makes a connection to the database by calling that stored procedure.

    We do not recommend calling other URLs; and there is no reason another page would work. Calling other pages will impact performance and likely yield incorrect results.

    You're right, it might have to do with a "connection pool" problem -- but that's all managed at the driver level. You can control connection pool settings by changing connection string values, but to be honest we have no idea what the impact will be.

    Your best bet is to open a case with Microsoft, who has the expertise to know how to troubleshoot these intermittent "error 40's". They may have specialized tools or "secret debugging commands" you can use.

    This is all happening at the operating-system or driver-level, and while we're doing our absolute best to help you troubleshoot, but we have never seen this happen before and our main resource is searching google. And this is not an easy error to search for,..as you've probably noticed.

    Please keep us in the loop.

    Best,
    Alana


Log in to reply
 

Inedo Website HomeSupport HomeCode of ConductForums GuideDocumentation