Tricky heisenbug related to database cleaner
We are experiencing a very rare CI failure related to isolator (approximately once every three days, with hundreds of suites run daily).
Here's what happens: when we run a specific test suite, at some point (usually closer to the end), all tests start failing with an isolation violation error:
Isolator::BackgroundJobError:
You are trying to enqueue background job inside db transaction. In case of transaction failure, this may lead to data inconsistency and unexpected bugs
Details: Site::PurgeCacheJob (#<Site:0x00007f9ecddf8360>, {:variant=>"pages"})
I don't have an MRE due to the nature of the bug, but here's what I found:
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Transaction opened for connection 31140 (total: 2, threshold: 2)
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Transaction opened for connection 31140 (total: 2, threshold: 2)
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Transaction opened for connection 31140 (total: 2, threshold: 2)
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 1, threshold: 2)
[ISOLATOR DEBUG] Threshold value was changed for connection 229240: 0
[ISOLATOR DEBUG] Transaction closed for connection 31140 (total: 0, threshold: 2) !!!!!! <<- threshold - total > 1
example at ./spec/requests/disk/sharings_controller_spec.rb:80 (FAILED - 1)
with two different IP addresses
[ISOLATOR DEBUG] Threshold value was changed for connection 229240: 1
[ISOLATOR DEBUG] Transaction opened for connection 229240 (total: 1, threshold: 1)
[ISOLATOR DEBUG] Transaction opened for connection 229240 (total: 2, threshold: 1)
[ISOLATOR DEBUG] Transaction closed for connection 229240 (total: 1, threshold: 1)
[ISOLATOR DEBUG] Transaction opened for connection 229240 (total: 2, threshold: 1)
At some point, for some reason, the threshold is reduced by one for a wrong connection. This made me suspicious that the issue might be related to database-cleaner support (https://github.com/palkan/isolator/blob/master/lib/isolator/database_cleaner_support.rb). I managed to confirm it by logging connection_id in both start and clean methods -- and under specific conditions they actually differ, leading to inconsistency in the threshold hashmap.
Perhaps we shouldn't rely on object_id here? Or use database-cleaner's approach?
Ruby Version:
3.3.7
Framework Version (Rails, whatever):
rails (8.0.2) database_cleaner (2.1.0) database_cleaner-active_record (2.2.0) database_cleaner-core (2.0.1)
Isolator Version:
isolator (1.1.1)
Looks like when the #clean is called the connection_class.connection already points to a new connection, no the one that was used in the #start.
Is there a test that reset Active Record connections or re-establish the connection?
I think, in Rails 7.2, we should use the "transaction.active_record" instrumentation event to update threshold only for the affected connections during the #start / #clean calls.
Looks like when the #clean is called the connection_class.connection already points to a new connection, no the one that was used in the #start.
Yep, that's definitely true, given that their #__id__ differs.
Is there a test that reset Active Record connections or re-establish the connection?
There are some tests leasing connections with with_connection.
Alright, I took a look at out test suite once again and there was actually a disconnect! call; however, after removal we still encounter these errors
@viralpraxis Let's add this issue on the TODO list for the end of May 😉
BTW, the easiest way to reproduce this issue is to temporary stop and start PostgreSQL again during running (DB-hitting) specs