dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Python test failed: test_replication_timeout_on_full_sync_heartbeat_expiry

Open vyavdoshenko opened this issue 8 months ago • 7 comments

https://github.com/dragonflydb/dragonfly/actions/runs/14570706790/job/40867358098

Error log:

E20250421 09:18:50.380826 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort
E20250421 09:19:49.676060 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort

Warning log:

W20250421 09:18:35.911154 21838 dfly_main.cc:808] Got memory limit 8.00GiB, however only 6.80GiB was found.
E20250421 09:18:50.380826 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort
W20250421 09:18:50.381072 21841 common.cc:413] ReportError: Software caused connection abort: Error loading rdb format
W20250421 09:18:50.381518 21841 replica.cc:240] Error syncing with localhost:43259 generic:103 Software caused connection abort
E20250421 09:19:49.676060 21841 rdb_load.cc:2154] Error while calling HandleCompressedBlob(type): Software caused connection abort
W20250421 09:19:49.676409 21841 common.cc:413] ReportError: Software caused connection abort: Error loading rdb format
W20250421 09:19:49.677003 21841 replica.cc:240] Error syncing with localhost:43259 generic:103 Software caused connection abort
W20250421 09:19:51.641038 21841 common.cc:413] ReportError: Software caused connection abort: Error loading rdb format
W20250421 09:19:51.641430 21841 replica.cc:240] Error syncing with localhost:43259 generic:103 Software caused connection abort
W20250421 09:19:51.751673 21841 common.cc:413] ReportError: Operation canceled: ExecutionState cancelled
W20250421 09:19:52.142002 21841 replica.cc:211] Error connecting to localhost:43259 generic:125

vyavdoshenko avatar Apr 21 '25 09:04 vyavdoshenko

https://github.com/dragonflydb/dragonfly/actions/runs/14624925511/job/41034027496

vyavdoshenko avatar Apr 23 '25 18:04 vyavdoshenko

same error (with same warnings and error logs) seen in https://github.com/dragonflydb/dragonfly/actions/runs/15112449382/job/42474851347#step:6:912

abhijat avatar May 20 '25 06:05 abhijat

same error, same error log https://github.com/dragonflydb/dragonfly/actions/runs/15380909241/job/43271509282#step:6:914

vyavdoshenko avatar Jun 03 '25 05:06 vyavdoshenko

https://github.com/dragonflydb/dragonfly/actions/runs/15511633655/job/43673327321

vyavdoshenko avatar Jun 08 '25 05:06 vyavdoshenko

trying to repro this locally but it seems to always pass

abhijat avatar Jun 18 '25 05:06 abhijat

A few things:

  1. This is epoll test. Do we really care about those for now ? CC @adiholden
  2. Test fails because await check_all_replicas_finished([c_replica], c_master, 60) fails

@abhijat For tests that fail like this I usually either use the CI and edit the workflow file to run the test in the loop or I create a similar aws instance as of our gh runner and run them there. These two work really well 😄

kostasrim avatar Jun 18 '25 06:06 kostasrim

I will try creating an AWS instance.

abhijat avatar Jun 18 '25 06:06 abhijat

https://github.com/dragonflydb/dragonfly/actions/runs/16561377556/job/46831618196

vyavdoshenko avatar Jul 28 '25 07:07 vyavdoshenko

https://github.com/dragonflydb/dragonfly/actions/runs/16591633344/job/46928489694#step:6:880

abhijat avatar Jul 30 '25 03:07 abhijat

trying a repeat of this test in ci

gh workflow run "Repeat Tests" --ref abhijat/chore/repeat-action-improvements -f branch=abhijat/chore/repeat-action-improvements -f count=100 -f expression="-xv dragonfly/replication_test.py -k test_replication_timeout_on_full_sync_heartbeat_expiry" -f timeout=120m -f use_release=yes -f epoll=epoll

abhijat avatar Sep 15 '25 06:09 abhijat

there was something odd going on in the logs for the last failure:

dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24182.log I20250729 09:24:57.121618 24183 dflycmd.cc:392] Started sync with replica 127.0.0.1:40543
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:57.121876 24188 replica.cc:613] Started full sync with localhost:41011
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:57.121892 24188 replica.cc:617] Waiting for all full sync cut confirmations
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24182.log I20250729 09:25:15.090881 24183 accept_server.cc:60] Exiting on signal Terminated

24182 is master and 24187 is the replica. There is an 18 second gap between the last two logs, the last log line is when the replication finished check times out, which is ~20~ 60 seconds.

abhijat avatar Sep 15 '25 06:09 abhijat

Well, we did not reach stable sync (we should see in the LOGs "sync finished in"

kostasrim avatar Sep 15 '25 06:09 kostasrim

also this is epoll right ?

kostasrim avatar Sep 15 '25 06:09 kostasrim

I do not see the sync finished log entry, this is epoll yes. I also see the pausing replication message twice which is odd, in the test we only pause once, unfortunately debug logs are not available

dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:07.834460 24188 replica.cc:181] Pausing replication
dragonfly.0e89ba509a80.root.log.INFO.20250729-092400.24187.log I20250729 09:24:13.839296 24188 replica.cc:181] Pausing replication

abhijat avatar Sep 15 '25 07:09 abhijat

I have run this for several hundred iterations now. latest https://github.com/dragonflydb/dragonfly/actions/runs/17758605373/job/50466118910 500 iterations, cannot see failure anymore. Note that while the last run was using prebuilt binary, with debug build also https://github.com/dragonflydb/dragonfly/actions/runs/17723521321/job/50359858628 there were no errors seen. I will close this one and we can reopen if it reoccurs.

abhijat avatar Sep 16 '25 12:09 abhijat