Python test failed: test_replication_timeout_on_full_sync_heartbeat_expiry
https://github.com/dragonflydb/dragonfly/actions/runs/14570706790/job/40867358098
Error log:
E20250421 09:18:50.380826 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort
E20250421 09:19:49.676060 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort
Warning log:
W20250421 09:18:35.911154 21838 dfly_main.cc:808] Got memory limit 8.00GiB, however only 6.80GiB was found.
E20250421 09:18:50.380826 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort
W20250421 09:18:50.381072 21841 common.cc:413] ReportError: Software caused connection abort: Error loading rdb format
W20250421 09:18:50.381518 21841 replica.cc:240] Error syncing with localhost:43259 generic:103 Software caused connection abort
E20250421 09:19:49.676060 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort
W20250421 09:19:49.676409 21841 common.cc:413] ReportError: Software caused connection abort: Error loading rdb format
W20250421 09:19:49.677003 21841 replica.cc:240] Error syncing with localhost:43259 generic:103 Software caused connection abort
W20250421 09:19:51.641038 21841 common.cc:413] ReportError: Software caused connection abort: Error loading rdb format
W20250421 09:19:51.641430 21841 replica.cc:240] Error syncing with localhost:43259 generic:103 Software caused connection abort
W20250421 09:19:51.751673 21841 common.cc:413] ReportError: Operation canceled: ExecutionState cancelled
W20250421 09:19:52.142002 21841 replica.cc:211] Error connecting to localhost:43259 generic:125
https://github.com/dragonflydb/dragonfly/actions/runs/14624925511/job/41034027496
same error (with same warnings and error logs) seen in https://github.com/dragonflydb/dragonfly/actions/runs/15112449382/job/42474851347#step:6:912
same error, same error log https://github.com/dragonflydb/dragonfly/actions/runs/15380909241/job/43271509282#step:6:914
https://github.com/dragonflydb/dragonfly/actions/runs/15511633655/job/43673327321
trying to repro this locally but it seems to always pass
A few things:
- This is epoll test. Do we really care about those for now ? CC @adiholden
- Test fails because
await check_all_replicas_finished([c_replica], c_master, 60)fails
@abhijat For tests that fail like this I usually either use the CI and edit the workflow file to run the test in the loop or I create a similar aws instance as of our gh runner and run them there. These two work really well 😄
I will try creating an AWS instance.
https://github.com/dragonflydb/dragonfly/actions/runs/16561377556/job/46831618196
https://github.com/dragonflydb/dragonfly/actions/runs/16591633344/job/46928489694#step:6:880
trying a repeat of this test in ci
gh workflow run "Repeat Tests" --ref abhijat/chore/repeat-action-improvements -f branch=abhijat/chore/repeat-action-improvements -f count=100 -f expression="-xv dragonfly/replication_test.py -k test_replication_timeout_on_full_sync_heartbeat_expiry" -f timeout=120m -f use_release=yes -f epoll=epoll
there was something odd going on in the logs for the last failure:
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24182.log I20250729 09:24:57.121618 24183 dflycmd.cc:392] Started sync with replica 127.0.0.1:40543
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:57.121876 24188 replica.cc:613] Started full sync with localhost:41011
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:57.121892 24188 replica.cc:617] Waiting for all full sync cut confirmations
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24182.log I20250729 09:25:15.090881 24183 accept_server.cc:60] Exiting on signal Terminated
24182 is master and 24187 is the replica. There is an 18 second gap between the last two logs, the last log line is when the replication finished check times out, which is ~20~ 60 seconds.
Well, we did not reach stable sync (we should see in the LOGs "sync finished in"
also this is epoll right ?
I do not see the sync finished log entry, this is epoll yes. I also see the pausing replication message twice which is odd, in the test we only pause once, unfortunately debug logs are not available
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:07.834460 24188 replica.cc:181] Pausing replication
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:13.839296 24188 replica.cc:181] Pausing replication
I have run this for several hundred iterations now. latest https://github.com/dragonflydb/dragonfly/actions/runs/17758605373/job/50466118910 500 iterations, cannot see failure anymore. Note that while the last run was using prebuilt binary, with debug build also https://github.com/dragonflydb/dragonfly/actions/runs/17723521321/job/50359858628 there were no errors seen. I will close this one and we can reopen if it reoccurs.