Proget database connection issues when server in SQL Always-On Availability group changes.

cshipley_6136

So we are having another connection issue, we figured out this was the same as reported before.

We run four instances of proget in containers for HA. Two of the four instances showed errors during the switchover and sync and recovered fine. The whole cluster seemed healthy after that. Upon further investigation, the other two instances were showing some errors hours after the fact. One was getting a login error and the other was talking about the server being in an non-accessable state due to the availability group. It's like they had open connections and wouldn't switch over when the listener was pointed back to the primary sql server. I do have logs showing such.

This exposes another large issue with the proget health check and cluster health. If I logged into one of the working instances it reported that all nodes were healthy and the cluster had no issues. Also, on the two instances that currently could not connect to the database and were getting sql errors the /health endpoint was still reporting database: OK. This causes the whole service to go down because since the healthcheck is okay, there is nothing that would be able to pull those two instances out of the load balancer pool. If the healthcheck gave a 500 or reported database error or something, this error wouldn't have taken our cluster down.

Here are some log entries of the errors. It's odd because all four instances use the same connection string etc but only two have a problem and both were showing a slightly different error.

I am not able to post logs here due to company privacy of hostnames and urls. If there is a support ticket opened I can send them there.

Thanks in advance for the support!

stevedennis

Hi @cshipley_6136 ,

Since it sounds like there's some logs/sensitive info.. I just submitted a ticket on your behalf (EDO-9257), so we'll work to troubleshoot from there!

Cheers,
Steve

reincarnator247_4909

Hello @stevedennis ,

I am experiencing similar issues after upgrading from 5.3 to 2022 version. Running a 2 instance cluster the errors show up hours afterwards same as above.

We also have the same design whereby the /health endpoints are showing OK and the monitoring were showing nodes healthy but in fact it was not so our loadbalancer did not work properly.

What is the solution to this problem?

stevedennis

Hi @reincarnator247_4909 , please submit a ticket for this with as many details as you can (types of error messages, configuration, etc.), so we can properly review and give advice. It's very case-by-case.

cshipley_6136

@reincarnator247_4909 If you open a ticket can you post it here so I can follow along as well. Or do we have a solution? To me, a call the the /health endpoint should always actually check if things are okay, not assume its okay because something is cached.