wally
wally copied to clipboard
Worker joining never completes migration
Is this a bug, feature request, or feedback?
Bug.
What is the current behavior?
See https://circleci.com/gh/WallarooLabs/wallaroo/18867
- Worker joins
- Worker never completes migration
- Data never resumes
- Test fails because "Sender failed to complete"
What is the expected behavior?
Worker should complete migration, source should resume, and sender should complete.
What OS and version of Wallaroo are you using?
wallaroolabs/wallaroo-ci:2019.04.02.1
Steps to reproduce?
make integration-tests-testing-correctness-tests-autoscale debug=true pytest_exp='-k test_autoscale_pony_alo_1_Grow1'
Additional comments
This is possibly a confluence of two bugs:
- The worker did not complete migration
- The signal used by the test harness to determine that migration is complete gave a false positive
This test should have failed in the migration completion test, rather than by waiting on the sender.
Depends on #2870
I believe that I've seen this problem happen again, in the context of a tests/system_events.py::test_autoscale_python2_MultiPartitionDetector_alo_1_Shrink1_Wait2_Shrink4 test instead of the simpler Pony Grow1 test. The test hangs after the 2nd shrink.
See test log details in http://wallaroolabs-dev.s3.amazonaws.com/logs/test-artifacts.c.20190710.tar.gz. After the INFO,ConnectorSource,Successfully removed 6834946543119556632 from _active message, the initializer continues without success (because they're all stopped!) reconnecting outgoing boundaries to worker1 through worker5 until the test times out.