Issues icon indicating copy to clipboard operation
Issues copied to clipboard

Tasks queued behind other interrupted tasks can run in the wrong order when resolving an interruption.

Open borland opened this issue 2 years ago • 2 comments

Team

  • [X] I've assigned a team label to this issue

Severity

Low. It is unlikely a customer would experience this, and if they did they may not notice

Version

Found in latest git master (nov 2022). Code which causes the bug dates to June 2021 or earlier.

Latest Version

I could reproduce the problem in the latest build

What happened?

If there is an interrupted task A (probably a deployment), with another task B behind it in the queue:

When resolving the interruption for task A, task A is supposed to resume. B should not start until A is complete.

There is a bug in the server that can cause task B to sometimes start instead; Task A does not resume until after B completes.

Reproduction

Pre-requisites:

Create a project which has a manual intervention step in its process (BlockConcurrentDeployments set to true) Configure it so a deployment is possible (create a release, etc)

Repro:

  • Start two deployments; this will create two server tasks, referred to as A and B hereafter
  • wait for task A to reach the manual intervention step

At this point, observe the task queue; you should see that task A is queued(interrupted) and task B is queued(not yet started)

  • resolve the interruption on task A

What should happen:

  • task A resumes and continues on with its deployment process until it reaches completion (whether it passes or fails doesn't matter here)
  • Only once task A completes should task B start. Task B should reach its own interruption state, and we can resolve it as normal.

However, this can occasionally happen instead:

  • rather than task A resuming, task B starts instead.
  • Task B reaches its own interruption

At this point, observe the task queue; task B is queued(interrupted) and task A is queued(stuck behind task A) Teamcity integration test logs suggest that this happens maybe one in every 10-20 times, depending on timing/luck in the server.

NOTE: I've attached a file 7930-repro.cs which contains an integration test which does all of the above that should be useful in reproducing the bug. You'll need to put a sleep in SubmitInterruptionResponder to trigger it; see below.

More Information

@droyad and I did some investigation, and believe we have identified the source of the problem

There are two actors (threads) here:

Thread 1: Resolving the interruption via a call to /api/interruptions/submit Thread 2: Task queue looking for the next piece of work to do.

Note: As at time of writing, thread 1 is the SubmitInterruptionResponder nancy responder. This is going to ported to MVC soon so it will change. The behaviour itself is not in the responder, it's just the thing that triggers it, so the port won't alter this.

Thread 1 creates a transaction, in which it

  • does an ordinary write to the Interruption table; updating a row to indicate the interruption has been resolved
  • writes an audit event
  • does a write to the ServerTask table, updating the task.HasPendingInterruptions field to false ***
  • checks for linked interruptions, and updates those (and any associated tasks) too
  • commits the transaction. +++

This is interesting because there is now a time window (starting at *** and ending at +++) where the "committed" state of the server task is HasPendingInterruptions=true, but if another thread were to do a database "dirty read" with NOLOCK or similar, it would observe HasPendingInterruptions=false

Thread 2

Inside TaskQueue.TryPopTask

  • First this calls GetActiveConcurrencyTags. GetActiveConcurrencyTags uses a custom SQL query with NOLOCK, so it can observe this "dirty" state where HasPendingInterruptions is different.

  • The next part of TryPopTask is where it attempts to read a batch of potential tasks with another custom SQL query. This query takes into account concurrency tags, and when things work correctly, this causes task B to be excluded from the candidate set of tasks. However, when GetActiveConcurrencyTags lookup happens to run in the time window where it observes the uncommitted transaction from thread 1, then its concurrency tag is not evaluated. Task B is incorrectly calculated to be the next candidate to run, and so the task queue starts it.

Reproduction

We can test this hypothesis by inserting a Sleep in SubmitInterruptionResponder, causing it to hold its transaction open for a long period of time. This increases the time window where GetActiveConcurrencyTags can observe a dirty value and makes this bug reliably reproducible.

In SubmitInterruptionResponder in the ExecuteRegistered function, on around 103 you should see transaction.Commit(). Just above this, insert a new line with Thread.Sleep(TimeSpan.FromSeconds(15)).

This should now cause the integration test in 7930-repro.cs to fail reliably. The test will normally pass in around 20-30 seconds. When the bug triggers it will instead fail with a timeout error after approx 4 minutes.

borland avatar Nov 27 '22 20:11 borland

7930-repro.zip

borland avatar Nov 27 '22 20:11 borland

Thoughts from Rob and I: The NOLOCK is the cause of the bug, and it does seem odd to be doing dirty reads in an area like the Task Queue where correctness is important.

We couldn't see any note or reason as to why the NOLOCK was put there though.

Usually things like this happen because people are attempting to fix database deadlocks, or perhaps for performance reasons; so please tread carefully when fixing this bug.

borland avatar Nov 27 '22 20:11 borland