Issues icon indicating copy to clipboard operation
Issues copied to clipboard

SQL timeouts when searching events, publishing or loading a runbook snapshot - Execution Timeout Expired

Open adam-mccoy opened this issue 4 years ago • 14 comments

Prerequisites

  • [x] I have verified the problem exists in the latest version
  • [x] I have searched open and closed issues to make sure it isn't already reported
  • [x] I have written a descriptive issue title
  • [x] I have linked the original source of this report
  • [x] I have labelled the value stream (area/core, area/steps, ...)
  • [x] I have tagged the issue appropriately (kind/bug, kind/enhancement, feature/ ...)

The bug

An Octopus instance with many events can experience SQL timeouts when searching for events in particular ways, like by user, project, environment, etc.

Steps to reproduce

This requires an Octopus instance with many millions of events.

  1. Go to Audit under Configuration
  2. Add filters for last 30 days and a user
  3. See error

Screen capture

image

Exception occurred while executing a reader for `SELECT *
FROM (
    SELECT *,
    ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
    FROM [dbo].[Event]
    WHERE ((([SpaceId] = 'Spaces-1') OR ([SpaceId] = 'Spaces-2') OR ([SpaceId] = 'Spaces-3')))
    AND ([Occurred] >= @from)
    AND ([Occurred] < @to)
    AND (([UserId] in (@userids_1) OR [Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId in (@userids_1))))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]`
SQL Error -2 - Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Unknown error 258

Affected versions

Octopus Server: 2021.1 onwards

Workarounds for on-premise customers

You need to clear the cached execution plans:

DBCC FREEPROCCACHE

Workarounds for Cloud (Azure SQL) customers

There are a few things that can improve the performance of searching events:

  • Add more compute power to the SQL Server instance, if possible
  • Rebuild the indexes on the EventRelatedDocument tables, see example script below
  • Re-running the query several times can allow the query to complete within the timeout limit due to internal caching done by SQL Server

See detailed steps here.

Rebuild EventRelatedDocument indexes

ALTER INDEX [IX_EventRelatedDocument_RelatedDocumentId] ON [dbo].[EventRelatedDocument] REBUILD;
ALTER INDEX [IX_EventRelatedDocument_EventId] ON [dbo].[EventRelatedDocument] REBUILD;

Note: this can take a long time to execute, depending on the number of events.

Links

Internal report: https://trello.com/c/faJFrpEl/37-sql-execution-timeout-on-deploy

adam-mccoy avatar Aug 09 '21 00:08 adam-mccoy

Customer reported experiencing this issue (internal ticket): https://octopus.zendesk.com/agent/tickets/80435

N-lson avatar Dec 09 '21 02:12 N-lson

Another customer experiencing this issue: https://octopus.zendesk.com/agent/tickets/81992

Proof that SQL could have chosen a better existing index:

  • Not specifying an index times out at 30s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1843049
  • Specifying an index finishes in under 1s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1843052
  • Not specifying an index but just adding a new comment so SQL will compile a new plan, under 1s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1844526

Specifying an index in a query is an anti-pattern so we need to figure out what's causing SQL to make this poor decision.

Another thing to optimize: the endpoint below generates 2 queries. https://capptivate.octopus.app/api/events?regarding=RunbookSnapshots-8567&spaces=Spaces-1&includeSystem=true

  1. Counts the number of events.
  2. Grabs the data.

We should consider only grabbing the data until we can't find more data. We should not need to count. We could cut this to a single query instead of 2. In addition, COUNT is not cheap.

List of queries from a single endpoint request:

-- this is taking 2s
exec sp_executesql N'SELECT COUNT(*)
FROM [dbo].[Event]
WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))',N'@regarding0 nvarchar(100)',@regarding0=N'RunbookSnapshots-8567'
exec sp_executesql N'SELECT *
FROM (
    SELECT *,
    ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
    FROM [dbo].[Event]
    WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
    AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-8567',@_minrow=1,@_maxrow=30

akirayamamoto avatar Jan 14 '22 03:01 akirayamamoto

Release Note: Fixed SQL timeout on the api/events endpoint.

octoreleasebot avatar Feb 02 '22 05:02 octoreleasebot

:tada: The fix for this issue has been released in:

Release stream Release
2021.3 2021.3.12042
2022.1 2022.1.359

Octobob avatar Feb 02 '22 18:02 Octobob

We are still experiencing this issue (internal): https://octopusdeploy.slack.com/archives/C01HZFJRYSH/p1643844703973029

akirayamamoto avatar Feb 03 '22 03:02 akirayamamoto

Additional Fixes supplied for the 2022.1 version from 2022.1.1286 onwards. This was not applied to the 2021.3 range

Release stream Release
2022.2 2022.2.0
2022.1 2022.1.1286

zentron avatar Feb 22 '22 04:02 zentron

Hi @zentron. I would like to keep this issue open for now since my last PR has a feature toggle that is not available to customers yet.

For now please rebuild the indexes as described in the workaround section.

akirayamamoto avatar Feb 22 '22 07:02 akirayamamoto

It might be worth trying the Automatic plan correction.

akirayamamoto avatar Jul 15 '22 13:07 akirayamamoto

The affected indexes listed above may be different depending on the version. In 2023.x versions the index IX_EventRelatedDocument_RelatedDocumentIdPrefix may also be affected and the IX_EventRelatedDocument_EventId index no longer exists.

benPearce1 avatar Jul 14 '23 04:07 benPearce1

We are experiencing the same timeouts when loading a runbook snapshot (running Octopus version 2023.2 (Build 12998). I don't think this is an index issue. This is due to the way the select statement is written and how SQL server engine changes the execution plan when data volumes increase significantly. When executing the select statement directly it now takes over 1 minute to complete. By changing the select statement to either imbedding the select statement from the EventRelatedDocument table first (either in a temp table or table variable), the select statement runs in under 110 milliseconds (temp table runs in 50ms, table variable runs in 110ms). e.g.

exec sp_executesql N'DECLARE @EventRelatedDocument TABLE (EventId NVARCHAR(50), RelatedDocumentId NVARCHAR(250))

INSERT INTO @EventRelatedDocument SELECT EventId, RelatedDocumentId from EventRelatedDocument where RelatedDocumentId = @regarding0

SELECT * FROM ( SELECT E.Id,RelatedDocumentIds,ProjectId,EnvironmentId,Category,UserId,Username,Occurred,Message,TenantId,AutoId,DataVersion,UserAgent,SpaceId,JSONBlob,ChangeDetails,IpAddress,JSON, ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN @EventRelatedDocument ER ON E.Id = ER.EventId WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null)))) and E.Id = ER.EventId ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum];',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;

mildmanneredjon avatar Apr 18 '24 07:04 mildmanneredjon

Further analysis and testing has provided another solution and that is to eliminate the key lookups caused by the index not having all columns in the INCLUDE. They way to do this is to join the Event table to itself and select the columns from the joined table. Tests have shown this to resolve the long duration of queries hitting runbook snapshots with large volumes of data.

exec sp_executesql N'SELECT * FROM ( SELECT E2.Id,E2.RelatedDocumentIds,E2.ProjectId,E2.EnvironmentId,E2.Category,E2.UserId,E2.Username,E2.Occurred,E2.Message, E2.TenantId,E2.AutoId,E2.DataVersion,E2.UserAgent,E2.SpaceId,E2.JSONBlob,E2.ChangeDetails,E2.IpAddress,E2.JSON, ROW_NUMBER() OVER (ORDER BY E2.[Occurred] DESC, E2.[AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN [dbo].[Event] E2 ON E.Id = E2.Id WHERE ((((E.[SpaceId] in (''Spaces-1''))) OR ((E.[SpaceId] is null)))) AND (E.[Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0)) ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;

mildmanneredjon avatar May 02 '24 11:05 mildmanneredjon

@mildmanneredjon when the query was taking more than 1 minute was that always occurring or was it only occurring when mssql happens to chose a bad plan.

LukeButters avatar Jun 04 '24 06:06 LukeButters

@LukeButters it was only occurring when a bad plan was chosen

mildmanneredjon avatar Jun 04 '24 10:06 mildmanneredjon