isolator icon indicating copy to clipboard operation
isolator copied to clipboard

Tricky heisenbug related to database cleaner

Open viralpraxis opened this issue 9 months ago • 5 comments

We are experiencing a very rare CI failure related to isolator (approximately once every three days, with hundreds of suites run daily).

Here's what happens: when we run a specific test suite, at some point (usually closer to the end), all tests start failing with an isolation violation error:

Isolator::BackgroundJobError:
          You are trying to enqueue background job inside db transaction. In case of transaction failure, this may lead to data inconsistency and unexpected bugs
          Details: Site::PurgeCacheJob (#<Site:0x00007f9ecddf8360>, {:variant=>"pages"})

I don't have an MRE due to the nature of the bug, but here's what I found:

[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Transaction opened for connection 31140 (total: 2, threshold: 2)
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Transaction opened for connection 31140 (total: 2, threshold: 2)
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Transaction opened for connection 31140 (total: 2, threshold: 2)
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Threshold value was changed for connection 229240: 0
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 0, threshold: 2) !!!!!! <<- threshold - total > 1
        example at ./spec/requests/disk/sharings_controller_spec.rb:80 (FAILED - 1)
      with two different IP addresses
[ISOLATOR DEBUG] Threshold value was changed for connection 229240: 1
[ISOLATOR DEBUG] Transaction opened for connection 229240 (total: 1, threshold: 1)
[ISOLATOR DEBUG] Transaction opened for connection 229240 (total: 2, threshold: 1)
[ISOLATOR DEBUG] Transaction closed for connection 229240 (total: 1, threshold: 1)
[ISOLATOR DEBUG] Transaction opened for connection 229240 (total: 2, threshold: 1)

At some point, for some reason, the threshold is reduced by one for a wrong connection. This made me suspicious that the issue might be related to database-cleaner support (https://github.com/palkan/isolator/blob/master/lib/isolator/database_cleaner_support.rb). I managed to confirm it by logging connection_id in both start and clean methods -- and under specific conditions they actually differ, leading to inconsistency in the threshold hashmap.

Perhaps we shouldn't rely on object_id here? Or use database-cleaner's approach?

Ruby Version:

3.3.7

Framework Version (Rails, whatever):

rails (8.0.2) database_cleaner (2.1.0) database_cleaner-active_record (2.2.0) database_cleaner-core (2.0.1)

Isolator Version:

isolator (1.1.1)

viralpraxis avatar Mar 18 '25 23:03 viralpraxis

Looks like when the #clean is called the connection_class.connection already points to a new connection, no the one that was used in the #start.

Is there a test that reset Active Record connections or re-establish the connection?

I think, in Rails 7.2, we should use the "transaction.active_record" instrumentation event to update threshold only for the affected connections during the #start / #clean calls.

palkan avatar Mar 19 '25 00:03 palkan

Looks like when the #clean is called the connection_class.connection already points to a new connection, no the one that was used in the #start.

Yep, that's definitely true, given that their #__id__ differs.

Is there a test that reset Active Record connections or re-establish the connection?

There are some tests leasing connections with with_connection.

viralpraxis avatar Mar 19 '25 00:03 viralpraxis

Alright, I took a look at out test suite once again and there was actually a disconnect! call; however, after removal we still encounter these errors

viralpraxis avatar Mar 24 '25 15:03 viralpraxis

@viralpraxis Let's add this issue on the TODO list for the end of May 😉

palkan avatar May 09 '25 18:05 palkan

BTW, the easiest way to reproduce this issue is to temporary stop and start PostgreSQL again during running (DB-hitting) specs

viralpraxis avatar Jun 03 '25 18:06 viralpraxis