Issues
Issues copied to clipboard
SQL timeouts when searching events, publishing or loading a runbook snapshot - Execution Timeout Expired
Prerequisites
- [x] I have verified the problem exists in the latest version
- [x] I have searched open and closed issues to make sure it isn't already reported
- [x] I have written a descriptive issue title
- [x] I have linked the original source of this report
- [x] I have labelled the value stream (area/core, area/steps, ...)
- [x] I have tagged the issue appropriately (kind/bug, kind/enhancement, feature/ ...)
The bug
An Octopus instance with many events can experience SQL timeouts when searching for events in particular ways, like by user, project, environment, etc.
Steps to reproduce
This requires an Octopus instance with many millions of events.
- Go to Audit under Configuration
- Add filters for last 30 days and a user
- See error
Screen capture

Exception occurred while executing a reader for `SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
FROM [dbo].[Event]
WHERE ((([SpaceId] = 'Spaces-1') OR ([SpaceId] = 'Spaces-2') OR ([SpaceId] = 'Spaces-3')))
AND ([Occurred] >= @from)
AND ([Occurred] < @to)
AND (([UserId] in (@userids_1) OR [Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId in (@userids_1))))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]`
SQL Error -2 - Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Unknown error 258
Affected versions
Octopus Server: 2021.1 onwards
Workarounds for on-premise customers
You need to clear the cached execution plans:
DBCC FREEPROCCACHE
Workarounds for Cloud (Azure SQL) customers
There are a few things that can improve the performance of searching events:
- Add more compute power to the SQL Server instance, if possible
- Rebuild the indexes on the
EventRelatedDocumenttables, see example script below - Re-running the query several times can allow the query to complete within the timeout limit due to internal caching done by SQL Server
See detailed steps here.
Rebuild EventRelatedDocument indexes
ALTER INDEX [IX_EventRelatedDocument_RelatedDocumentId] ON [dbo].[EventRelatedDocument] REBUILD;
ALTER INDEX [IX_EventRelatedDocument_EventId] ON [dbo].[EventRelatedDocument] REBUILD;
Note: this can take a long time to execute, depending on the number of events.
Links
Internal report: https://trello.com/c/faJFrpEl/37-sql-execution-timeout-on-deploy
Customer reported experiencing this issue (internal ticket): https://octopus.zendesk.com/agent/tickets/80435
Another customer experiencing this issue: https://octopus.zendesk.com/agent/tickets/81992
Proof that SQL could have chosen a better existing index:
- Not specifying an index times out at 30s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1843049
- Specifying an index finishes in under 1s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1843052
- Not specifying an index but just adding a new comment so SQL will compile a new plan, under 1s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1844526
Specifying an index in a query is an anti-pattern so we need to figure out what's causing SQL to make this poor decision.
Another thing to optimize: the endpoint below generates 2 queries. https://capptivate.octopus.app/api/events?regarding=RunbookSnapshots-8567&spaces=Spaces-1&includeSystem=true
- Counts the number of events.
- Grabs the data.
We should consider only grabbing the data until we can't find more data. We should not need to count. We could cut this to a single query instead of 2. In addition, COUNT is not cheap.
List of queries from a single endpoint request:
-- this is taking 2s
exec sp_executesql N'SELECT COUNT(*)
FROM [dbo].[Event]
WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))',N'@regarding0 nvarchar(100)',@regarding0=N'RunbookSnapshots-8567'
exec sp_executesql N'SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
FROM [dbo].[Event]
WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-8567',@_minrow=1,@_maxrow=30
Release Note: Fixed SQL timeout on the api/events endpoint.
:tada: The fix for this issue has been released in:
| Release stream | Release |
|---|---|
| 2021.3 | 2021.3.12042 |
| 2022.1 | 2022.1.359 |
We are still experiencing this issue (internal): https://octopusdeploy.slack.com/archives/C01HZFJRYSH/p1643844703973029
Additional Fixes supplied for the 2022.1 version from 2022.1.1286 onwards. This was not applied to the 2021.3 range
| Release stream | Release |
|---|---|
| 2022.2 | 2022.2.0 |
| 2022.1 | 2022.1.1286 |
Hi @zentron. I would like to keep this issue open for now since my last PR has a feature toggle that is not available to customers yet.
For now please rebuild the indexes as described in the workaround section.
Another affected customer according to this internal thread
It might be worth trying the Automatic plan correction.
The affected indexes listed above may be different depending on the version.
In 2023.x versions the index IX_EventRelatedDocument_RelatedDocumentIdPrefix may also be affected and the IX_EventRelatedDocument_EventId index no longer exists.
We are experiencing the same timeouts when loading a runbook snapshot (running Octopus version 2023.2 (Build 12998). I don't think this is an index issue. This is due to the way the select statement is written and how SQL server engine changes the execution plan when data volumes increase significantly. When executing the select statement directly it now takes over 1 minute to complete. By changing the select statement to either imbedding the select statement from the EventRelatedDocument table first (either in a temp table or table variable), the select statement runs in under 110 milliseconds (temp table runs in 50ms, table variable runs in 110ms). e.g.
exec sp_executesql N'DECLARE @EventRelatedDocument TABLE (EventId NVARCHAR(50), RelatedDocumentId NVARCHAR(250))
INSERT INTO @EventRelatedDocument SELECT EventId, RelatedDocumentId from EventRelatedDocument where RelatedDocumentId = @regarding0
SELECT * FROM ( SELECT E.Id,RelatedDocumentIds,ProjectId,EnvironmentId,Category,UserId,Username,Occurred,Message,TenantId,AutoId,DataVersion,UserAgent,SpaceId,JSONBlob,ChangeDetails,IpAddress,JSON, ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN @EventRelatedDocument ER ON E.Id = ER.EventId WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null)))) and E.Id = ER.EventId ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum];',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;
Further analysis and testing has provided another solution and that is to eliminate the key lookups caused by the index not having all columns in the INCLUDE. They way to do this is to join the Event table to itself and select the columns from the joined table. Tests have shown this to resolve the long duration of queries hitting runbook snapshots with large volumes of data.
exec sp_executesql N'SELECT * FROM ( SELECT E2.Id,E2.RelatedDocumentIds,E2.ProjectId,E2.EnvironmentId,E2.Category,E2.UserId,E2.Username,E2.Occurred,E2.Message, E2.TenantId,E2.AutoId,E2.DataVersion,E2.UserAgent,E2.SpaceId,E2.JSONBlob,E2.ChangeDetails,E2.IpAddress,E2.JSON, ROW_NUMBER() OVER (ORDER BY E2.[Occurred] DESC, E2.[AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN [dbo].[Event] E2 ON E.Id = E2.Id WHERE ((((E.[SpaceId] in (''Spaces-1''))) OR ((E.[SpaceId] is null)))) AND (E.[Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0)) ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;
@mildmanneredjon when the query was taking more than 1 minute was that always occurring or was it only occurring when mssql happens to chose a bad plan.
@LukeButters it was only occurring when a bad plan was chosen