Buildmaster - High CPU database since 6.2.22

philippe.camelio_3885 · 6 Jan 2021, 09:35

Hello
The SQL Serve is anormally overloaded since release 6.2.22.
One request :
DELETE [Executions]
WHERE [Execution_Id] = @Execution_Id
seems to impact a lot the server

I have seen two big table (7 Go and 12 Go):
dbo.ScopedExecutionLogs / dbo.ScopedExecutionLogEntries

any advice is welcome
Best regards
Philippe

atripp · 7 Jan 2021, 06:20

Hi Philippe,

I don't think there's anything in particular that would have changed that would cause this slowness.

Those tables are big, and I would definitely recommend purging old data from them.

In any case, the problem is likely index fragmentation. Deleting an execution will cascade to those tables, so I'm thinking there might be some kind of indexing fragmentation or problem.

Here's some information about how to detect and repair index fragmentation: https://inedo.com/support/kb/1167/sql-server-recommendations

Can you try that, and see if it helps?

Thanks, Alana

philippe.camelio_3885 · 8 Jan 2021, 14:28

Hello @atripp

I repaired the index, and it seems to help :)
I let you know what I still have the problem

Best Regards
PhilippeC.

philippe.camelio_3885 · 11 Jan 2021, 18:13

Hi @atripp
The problem is stil present.
Do you have any magic sql scripts to purge my huge tables ?
Thanks in advance

atripp · 14 Jan 2021, 03:52

Typically these are cleaned up via retention policies... but where is the DELETE [Executions] WHERE [Execution_Id] = @Execution_Id being called?

philippe.camelio_3885 · 21 Jan 2021, 16:17

Hello @atripp
This the Manual Execution Cleanup.

This process is really annoying.

You have no SQL scripts to cleanup the table ?

atripp · 22 Jan 2021, 10:54

We can definitely try to diagnose what's going on. What are the types of Executions in that table? Manual? Build? Etc?

philippe.camelio_3885 · 23 Jan 2021, 16:10

@atripp said in Buildmaster - High CPU database since 6.2.22:

what are the types of Executions in that table? Manual? Build? Etc?

Sorry, I am not sure to understand what you expect.
I guess every log - manual / build / deployement.

I have setup this in the Administration part:

Diagnostics.MinimumLogLevel : 20
Retention.KeepLastManualExecutionCount: 1000

I have 3 Retention policies:

Delete deployed release Retain 180 days
Delete artifacts for rejected build : Retain 30 days
Delete execution logs : retain 30 days

Therefore, logs haven't been removed for more than 36 month !!!
I have the same in ScopedExecutionsLogEntries

SELECT TOP (10) [Execution_Id]
      ,[LogEntry_Sequence]
      ,[Scope_Sequence]
      ,[LogEntry_Level]
      ,[LogEntry_Date]
  FROM [Otter].[dbo].[ScopedExecutionLogEntries]

Execution_Id	LogEntry_Sequence	Scope_Sequence	LogEntry_Level	LogEntry_Date
14	1	1	0	2019-04-02 09:09:23.500
14	2	1	0	2019-04-02 09:09:23.500
14	3	1	10	2019-04-02 09:09:23.500

atripp · 25 Jan 2021, 06:38

Hey @philippe-camelio_3885

You may be looking at the wrong database... based on Otter in your query.

Here's a way to see executions by execution type: SELECT [ExecutionMode_Code], COUNT(*) FROM [BuildMaster]..[Executions] GROUP BY [ExecutionMode_Code]

As for retention policies, you should be able to see the logs of those., and see what's being purged.

Anwyays, w'ell figure it out... hang tight!

philippe.camelio_3885 · 25 Jan 2021, 07:39

@atripp
Houps !

About the retention policies, when I start them manually, they remove few lines.

Here is the result of the SQL:

R 91369
M 164051
S 1560
T 377
B 999

Thanks
Philippe

atripp · 25 Jan 2021, 08:33

Thanks Philippe!! That's helpful.

Can you try this sql?

USE [BuildMaster]

BEGIN TRANSACTION

DELETE [Executions]
 WHERE [ExecutionMode_Code] IN ('R', 'M', 'T')
   AND ROW_NUMBER() OVER(PARTITION BY [ExecutionMode_Code] ORDER BY [Execution_Id] DESC) > 1000

SELECT [ExecutionMode_Code], COUNT(*) FROM [Executions] GROUP BY [ExecutionMode_Code]

ROLLBACK

You should then see results like this:

You can further inspect the tables, but this should do the trick. If the results look okay, then please run only the DELETE statement and then it will be fine.

Can you let me know if it works? we will incorporate this logic in BM-3659

philippe.camelio_3885 · 26 Jan 2021, 08:50

I have this error msg from SQL Management Studio or from osql

Msg 4108, Level 15, State 1, Line 7
Les fonctions fenêtrées peuvent uniquement apparaître dans les clauses SELECT ou ORDER BY

I am not good enough in SQL to overcome this. Any tricks ?

atripp · 26 Jan 2021, 09:10

My bad, can you try this instead? Basically we are trying to delete all "R, M, T" executions except the most recent 1000 of each type.

This is what the code is doing now, just really inefficiently for some reason -- and the inefficiency seems to have caused a "backlog" of sorts.

USE [BuildMaster]

BEGIN TRANSACTION

DELETE [Executions]
  FROM [Executions] E,
     (SELECT [Execution_Id], 
             ROW_NUMBER() OVER(PARTITION BY [ExecutionMode_Code] ORDER BY [Execution_Id] DESC) [Row]
       FROM [Executions]
      WHERE [ExecutionMode_Code] IN ('R', 'M', 'T')) EE
WHERE E.[Execution_Id] = EE.[Execution_Id]
  AND EE.[Row] > 1000

SELECT [ExecutionMode_Code], COUNT(*) FROM [Executions] GROUP BY [ExecutionMode_Code]

ROLLBACK

philippe.camelio_3885 · 26 Jan 2021, 17:41

Buildmaster is broken
Time out when I try to get an access to BM.

I reboot unsuccessfully

but the Database is clean

philippe.camelio_3885 · 26 Jan 2021, 18:16

I had to stop SQL Instance, Restart BM Service
Everything looks fine

philippe.camelio_3885 · 29 Jan 2021, 08:07

The request still eating all my CPU :(

atripp · 29 Jan 2021, 09:35

How often is this happening? It shows 102 executions were purged, and based on the I/O there was a lot of logs deleted... this can be actually quite resource-intensive, as there are a lot of log data.

But this usually happen during off-hours, etc., so it shouldn't be disturbing.

philippe.camelio_3885 · 30 Jan 2021, 21:22

@atripp
Unfortunately, it is disturbing.

I join 2 screenshoot about the impact of the delete sql and it is happening during working hours.

I find that deleting several hours of CPU and IO for only 102 purged executions take too long

As I said, the problem exists since release 6.2.22.

I stopped all retention policies,

The problem occurs when the Manual Execution Cleanup is running

==============

I ran again your code:
Before the execution:

After

(255268 rows affected)

Completion time: 2021-01-30T22:06:02.9603752+01:00

SELECT [ExecutionMode_Code], COUNT(*) FROM [Executions] GROUP BY [ExecutionMode_Code]

I set Retention.KeepLastManualExecutionCount to 50.

I will see what is going on

Best Regards
Philippe

PS - Thanks for the help

philippe.camelio_3885 · 30 Jan 2021, 21:29

This is what is happening if I force Manual Execution Cleanup to run.
The stop of the BuildMaster Service solved the problem, but BM is down :(

To be more complete, here is the steps I followed

1 Execution of the delete script
2 Restart buildmaster
4 Start Manual Execution Cleanup ==> Problem occurs
4 Start Manual Execution Cleanup ==> Problem occurs

atripp · 1 Feb 2021, 10:28

If I'm understanding correctly, did your the Manual Execution records go from 1000 to 164,000 in just a few days? If so, that would explain a lot....

These are the types of so-called Manual Executions:

Importing or Exporting Applications
Cloning and Applying Template to Applications
Sync of Issue Sources
Deploying Configuration file
Upgrading Inedo Agents
Sync infrastructure

They are supposed to only occur on a manual basis, like when you trigger something from the UI so you can get logs. Or, in the case of sync infrastructure, whenever infrastructure changes.

Any idea what all the manual executions could be?

philippe.camelio_3885 · 1 Feb 2021, 10:51

@atripp said in Buildmaster - High CPU database since 6.2.22:

Importing or Exporting Applications **==> ** never done this **
Cloning and Applying Template to Applications **==> ** one time a month maximum ****
Sync of Issue Sources ==> Not sure what this is about ???
Deploying Configuration file **==> ** No ****
Upgrading Inedo Agents ==> done by Otter in my case
Sync infrastructure ==> Used to sync every 2 hours from Otter. I increased since the last post to one day

I have about 100-110 servers (Linux / windows) sync from Otter, whereas I am using a just a few for deploying application (about 25 actually).
I have plenty of Role

It could be Sync infrastructure, any way to check ?

atripp · 1 Feb 2021, 11:05

The Execution_Configuration column of the ManualExecutions table will give a clue; it's XML but if you expand the coluumn, you'll see the name of the manual execution.

It's only supposed to log if something changed, however...

If there's a bug, one way to check would be to disable infrastructure sync, for the time being.

philippe.camelio_3885 · 1 Feb 2021, 12:24

@atripp
I disabled the Infrastructure sync.
Then I clean up the table

Table size before

Disable Infrastructure sync
Stop buildmaster
Cleanup

Table size after

Restart buildmaster

I will check tomorrow about the size and let you know

Thanks

philippe.camelio_3885 · 1 Feb 2021, 20:35

I put some new screenshoots

and a lot of this kind of msg

In 10 hours:

ScopedExecutionLogentries x 20 (600k -> 12M)
ScopedExecutionLogs : x 34
ManualExecutions x 164

I set Service.AgentUpdaterThrottle to -1
I also increase to 1 hour in order to reduced log.

I will check tomorrow morning

atripp · 2 Feb 2021, 07:28

Thanks @philippe-camelio_3885

So, the good news is, we've identified the problem. There was just a huge number of manual executions happening, for some reason, and the manual execution purging routine could never catch up. Changing those throttles wouldn't make a difference I'm afraid, as none will trigger a manual execution...

At first, can you please share the results of this query, so we can see what made all those?

SELECT [ExecutionType_Name], COUNT(*) FROM [ManualExecutions] GROUP BY [ExecutionType_Name]

That will tell us what Manual Executions are around, mostly so we can understand what it is. I suspect, infrastructure sync.

That being said... the first thing I'm now seeing is that the report looks old. It's because the number of rows is 164,125, which is the exact same number as from before. So, I'm thinking, actually, you didn't commit the transaction in the query I posted before? It included a ROLLBACK statement as a safety measure... that's my fault, I should have said to only run DELETE if you were satisfied.

Since the query seems okay (it reduced rows from 164K down to 1k), please run this:

DELETE [Executions]
  FROM [Executions] E,
     (SELECT [Execution_Id], 
             ROW_NUMBER() OVER(PARTITION BY [ExecutionMode_Code] ORDER BY [Execution_Id] DESC) [Row]
       FROM [Executions]
      WHERE [ExecutionMode_Code] IN ('R', 'M', 'T')) EE
WHERE E.[Execution_Id] = EE.[Execution_Id]
  AND EE.[Row] > 1000

From here, it should actually be fine...

philippe.camelio_3885 · 2 Feb 2021, 21:52

@atripp

SELECT [ExecutionType_Name], COUNT(*) FROM [ManualExecutions] GROUP BY [ExecutionType_Name]

After my last post I purged again the table and it did not increase today. - no rollback I am pretty sure

I put again Sync Infra to 1 hour and ran the to following SQL

use BuildMaster
SELECT [ExecutionType_Name], COUNT(*) FROM [ManualExecutions] GROUP BY [ExecutionType_Name]

SELECT [ExecutionMode_Code], COUNT(*) FROM [BuildMaster]..[Executions] GROUP BY [ExecutionMode_Code]

we will see tomorrow

anyway thank you for your time :)

atripp · 3 Feb 2021, 01:54

@philippe-camelio_3885 thanks. please keep us in the loop!

As expected, that's a LOT of infrastructure sync executions. I wonder why. Are there frequent variable changes on your servers/roles?

There's probably something off, where it's logging when it shouldn't. We can investigate that another time, but in the meantime, the upcoming optimizations in pruning the manual executions should make this go a lot faster next time.

philippe.camelio_3885 · 3 Feb 2021, 07:08

I created on the average 2 servers a week and plenty of roles ...
I am using Otter for configuration management and script deployment so I need a lot of nested roles for the Linux and Windows servers.
I also add/change frequently role variables.

philippe.camelio_3885 · 7 Feb 2021, 13:55

Hello @atripp

Update ofthe day
Infra sync done every 3600 s

Few change in Otter this week - 12 new VM

I am not sure this the Infra sync the cause of the log ....
anyway, the Delete command worked fine specially after the commit

Cheers

atripp · 8 Feb 2021, 02:43

Thanks for the update; and the upcoming fixes will certainly make it so that purging is much more efficient on the manual execution side, just in case there's an "Explosion" of executions like this.

So right now, my concern is that it's "logging every sync" (once per hour), due to a sort of bug or something. Can you check what's getting logged in Infrastructure Sync? You should be able to see this under Admin > Executions, and see if you can spot a pattern?

No rush. The infra sync executions should clearly show a change history of what was updated on the infra side.