Welcome to the Inedo Forums! Check out the Forums Guide for help getting started.
If you are experiencing any issues with the forum software, please visit the Contact Form on our website and let us know!
Otter server receives thousands of connections from agent after reboot
-
I have an Otter server and a managed device configured using the Listen for Inedo Agent (i.e. a Pull agent).
When the agent is first installed and the server object is first added to Otter, everything works as it should:
16/01/2025 04:38:17: Starting agent connector for otter.lab.local:46336... 16/01/2025 04:38:17 DEBUG: Attempting to establish connection with otter.lab.local:46336... 16/01/2025 04:38:17 DEBUG: Connection established with otter.lab.local:46336.
However, if I reboot the machine that is running the Agent, it never reliably recovers:
16/01/2025 04:50:20: Starting agent connector for otter.lab.local:46336... 16/01/2025 04:50:20 DEBUG: Attempting to establish connection with otter.lab.local:46336... 16/01/2025 04:50:20 DEBUG: Connection established with otter.lab.local:46336. 16/01/2025 04:50:20: Connection to otter.lab.local:46336 dropped. 16/01/2025 04:50:20 DEBUG: Attempting to establish connection with otter.lab.local:46336... 16/01/2025 04:50:20 DEBUG: Connection established with otter.lab.local:46336. 16/01/2025 04:50:20: Connection to otter.lab.local:46336 dropped. 16/01/2025 04:50:20 DEBUG: Attempting to establish connection with otter.lab.local:46336... 16/01/2025 04:50:20 DEBUG: Connection established with otter.lab.local:46336. 16/01/2025 04:50:20: Connection to otter.lab:46336 dropped. ...
If I look at the Agent Listener Dashboard I can see thousands of connections (a
SELECT COUNT(1) FROM AgentConnections;
is now up to 33,058 in approx 15 mins), and the Diagnostics Center shows numerous messages:Error sending handshake response to 172.31.15.123:51530: System.ArgumentException: An item with the same key has already been added. Key: Inedo.Agents.PullAgentHostIdentifier at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior) at Inedo.Agents.InedoPullAgentServer.ConnectionEstablishedAsync(PullAgentHostIdentifier hostIdentifier, PullServerConnection connection, CancellationToken cancellationToken) at Inedo.Agents.AgentListener`1.ProcessIncomingConnection(TConnection channel)
If I restart the Otter Server service (optionally issuing
TRUNCATE TABLE AgentConnections;
beforehand), everything calms down again -- at least until the machine running the Agent reboots again (or the Agent service restarts).
Looking at a decompilation of the .NET, the Server service code appears to be trying to add to an private, in-memory collection of open connections (
Inedo.Agents.InedoPullAgentServer.openConnections
) ultimately indexed by what I think is the Agent's secret key (Inedo.Agents.PullAgentHostIdentifier.UniqueKey
). There is some logic that tries to remove from this collection when it detects a disconnection, but I don't know if that logic is actually firing, or if it is subtly incorrect.It seems to me that, when a connection is re-established after an agent is restarted, it would still always send the same secret key, so this add operation can never succeed.
I am not familiar enough with the logic of why the Server service needs to do this; I can only see that collection being manipulated, never actually queried (but I might not have a complete decompilation).
If there is a tangible reason for why the Server service needs to maintain this list, should it instead be added by
openConnections[key] = value
which would overwrite any existing entry for that key (as opposed to.Add(key, value)
, which throws if the key exists); or shouldPullAgentHostIdentifier
combine and compare more information (e.g. source IP and TCP port) to increase its uniqueness?(I only have the one agent machine defined in my Otter lab at the moment, so it shouldn't be the case that two agents might exist with the same secret key value.)
Also, during my diagnosis, I can see that there is the intention of a (potentially-configurable) 30-second delay between reconnection attempts (
Inedo.Agents.AgentConnectionConfig.ReconnectDelay
), but I can't see it being used anywhere (again, it may be an incomplete decompilation).I was seeing reconnections as much as 1,000 times per second, so there clearly isn't a delay enforced anywhere else.
As an aside, I also note that the
/administration/agent-listener
page in the web application does not let me delete stale connections -- a per-row button exists, but throws a JavaScript error.(Not that I want to manually click each connection when I have thousands of them to clean up -- a bulk-delete button would be a useful feature here! Some paging on that screen would be a bonus, too )
Is issuing
DELETE FROM AgentConnections WHERE ConnectionStatus_Code = 'D';
sufficient for me to clean these up from the view?
-
Havign attached a debugger to the Otter.Service.exe process in my lab, I am fairly certain that
InedoPullAgentServer.HandleDisconnected
(see Inedo.Agents.Client.dll) is never actually fired. This method is the only one I can see that is responsible for removing active connections from theopenConnections
collection.It is supposed to be fired by the
PullServerConnection.Disconnected
event (inherited fromAgentConnection<T>
) -- at least, when the connection is created byInedoAgentClientListener.CreateConnection
, an event handler is bound which would callHandleDisconnected
, but I can't see anywhere where the base event is ever actually triggered.To restore stability to my lab system, I have monkey-patched the
InedoPullAgentServer.ConnectionEstablisedAsync
method, so that the new connection always overwrites any existing one in the collection. I have tried to clean up the existing connection, but asynchronous C# is not my strongest suit, and I can't see through the multiple layers of indirection to determine if there are any major reasons why Otter shouldn't do this.The patched method is below, for your review; feel free to use it if it suits your needs:
internal async Task ConnectionEstablishedAsync(PullAgentHostIdentifier hostIdentifier, PullServerConnection connection, CancellationToken cancellationToken) { await this.ValidateConnectionAsync(hostIdentifier, connection, cancellationToken).ConfigureAwait(false); Dictionary<PullAgentHostIdentifier, PullServerConnection> dictionary = this.openConnections; lock (dictionary) { PullServerConnection existingConnection; if (this.openConnections.TryGetValue(hostIdentifier, out existingConnection) && existingConnection != null) { using (existingConnection) { try { // HACK: provides an exception with valid stack trace to HandleDroppedConnection below // ASSUMPTION: HandleDroppedConnection records the exception somewhere, e.g. the Diagnostics Centre, so // needs to be filled with useful info; if not, we can avoid the try/catch // TODO: is there any more suitable exception here...? throw new Exception("An existing pull agent connection was abandoned"); } catch (Exception ex) { // NOTE: HandleDisconnected also locks on (dictionary = openConnections), so this has to be // done within the current thread to avoid deadlocking // TODO: are there any other side-effects...? // update the back-end database AgentConnections and clean up the in-memory collection existingConnection.HandleDroppedConnection(ex); this.HandleDisconnected(existingConnection); } } } // ...regardless, we always overwrite any existing entry in the in-memory collection // (dictionary's indexed setter using InsertionBehavior.Overwrite) this.openConnections[hostIdentifier] = connection; } }
(Note that the above was derived from a decompilation tool, so may not exactly match your existing naming, etc.)
-
@jimbobmcgee thanks again for the detailed analysis & debugging, definitely not easy doing so "blind" like that :)
Anyway we will investigate/patch via OT-516 - I haven't looked but what you're describing sounds like a reasonable conclusion, i.e. the incoming agent isn't getting matched up.