wally
wally copied to clipboard
2 worker log rotation only rotates 1 worker's log file
Is this a bug, feature request, or feedback?
Bug
What is the current behavior?
Intermittent integration test failure
What is the expected behavior?
Test always passes
What OS and version of Wallaroo are you using?
Ubuntu Xenial 16.04.6 LTS, Wallaroo master branch at commit f748f92efe3d3ee08dc9a454ce6dd66d6a9c56fe
Steps to reproduce?
while [ 1 ]; do make integration-tests-testing-conformance-all debug=true resilience=on PONYCFLAGS="--verbose=1"; if [ $? -ne 0 ]; then echo STOP; exit 1; fi; done
See log files at http://wallaroolabs-dev.s3.amazonaws.com/logs/test-artifacts.a.20190708.tar.gz
If I massage the initializer and worker1 logs to add INIT and WRK1 fields to their logs, then also ignore GlobalConnectorStreamRegistry and ConnectorSource,Sending acks lines, then here's the timeline. Log rotation takes place on worker1 but not on initializer. The test harness never progresses to the later steps of the test case because the expectation that the log rotation step would complete promptly was false.
Note also that the initializer does not initiate new checkpoints after checkpoint #3 had finished and the log rotation started.
1562375768.721872,INIT,|~~ INIT PHASE IV: Cluster is ready to work! ~~|
1562375768.721904,INIT,Sent control message to worker1
1562375768.721911,INIT,initializer reported topology ready!
1562375768.721916,INIT,All 2 workers reporting Topology ready!
1562375768.721954,INIT,Application has successfully initialized.
1562375768.722111,WRK1,|~~ INIT PHASE IV: Cluster is ready to work! ~~|
1562375769.021142,INIT,ExternalChannelConnectNotifier (initializer): server closed
1562375769.032358,WRK1,ExternalChannelConnectNotifier (worker1): server closed
1562375769.046096,INIT,ExternalChannelConnectNotifier (initializer): server closed
1562375769.176012,INIT,INFO,ConnectorSource,Detector source: accepted a connection 0x7f7657e55000
1562375769.723003,INIT,Sent control message to worker1
1562375769.723042,INIT,Initiating barrier protocol for CheckpointBarrierToken(1).
1562375769.723081,INIT,Sent control message to worker1
1562375769.725715,INIT,Checkpoint_Initiator: Event Log CheckpointId 1 complete for worker initializer
1562375769.725754,WRK1,Sent control message to initializer
1562375769.725771,WRK1,Sent control message to initializer
1562375769.725902,INIT,Checkpoint_Initiator: Event Log CheckpointId 1 complete for worker worker1
1562375769.725928,INIT,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(1) Complete
1562375769.726023,INIT,Sent control message to worker1
1562375769.726191,WRK1,Sent control message to initializer
1562375769.726946,INIT,Sent control message to worker1
1562375770.742000,INIT,Sent control message to worker1
1562375770.742020,INIT,Initiating barrier protocol for CheckpointBarrierToken(2).
1562375770.747770,INIT,Sent control message to worker1
1562375770.793775,WRK1,Sent control message to initializer
1562375770.793789,WRK1,Sent control message to initializer
1562375770.797057,INIT,Checkpoint_Initiator: Event Log CheckpointId 2 complete for worker worker1
1562375770.803078,INIT,Checkpoint_Initiator: Event Log CheckpointId 2 complete for worker initializer
1562375770.816201,INIT,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(2) Complete
1562375770.823021,INIT,Sent control message to worker1
1562375770.834917,WRK1,Sent control message to initializer
1562375770.846263,INIT,Sent control message to worker1
1562375771.864440,INIT,Sent control message to worker1
1562375771.864470,INIT,Initiating barrier protocol for CheckpointBarrierToken(3).
1562375771.877879,INIT,Sent control message to worker1
1562375771.964803,WRK1,Sent control message to initializer
1562375771.964818,WRK1,Sent control message to initializer
1562375771.977032,INIT,Checkpoint_Initiator: Event Log CheckpointId 3 complete for worker worker1
1562375771.990213,INIT,Checkpoint_Initiator: Event Log CheckpointId 3 complete for worker initializer
1562375772.038154,INIT,Checkpoint_Initiator: Checkpoint Barrier CheckpointBarrierToken(3) Complete
1562375772.054010,INIT,Sent control message to worker1
1562375772.085572,WRK1,Sent control message to initializer
1562375772.115868,INIT,Sent control message to worker1
1562375772.211507,INIT,ExternalChannelConnectNotifier (initializer): server closed
1562375772.231450,WRK1,ExternalChannelConnectNotifier (worker1): server closed
1562375772.231470,WRK1,Starting event log rotation.
1562375772.231510,WRK1,Sent control message to initializer
1562375772.242760,INIT,Starting event log rotation.
1562375772.245348,INIT,Initiating barrier protocol for LogRotationBarrierToken(worker1->1).
1562375772.245428,INIT,Initiating barrier protocol for LogRotationBarrierToken(initializer->1).
1562375772.258559,INIT,Sent control message to worker1
1562375772.258578,INIT,Sent control message to worker1
1562375772.384603,WRK1,Sent control message to initializer
1562375772.444582,INIT,Sent control message to worker1
1562375772.467723,WRK1,EventLog: Rotating log file.
1562375772.467861,WRK1,EMERGENCY,none,START of rotation: finished writing to /tmp/res-data.wzd07f_t/Multi Partition Detector-worker1-0000000000000000.evlog
1562375772.468755,WRK1,EMERGENCY,none,FileBackend: create: filepath = /tmp/res-data.wzd07f_t/Multi Partition Detector-worker1-0000000000007D89.evlog
1562375772.469244,WRK1,EMERGENCY,none,END of rotation: starting writing to /tmp/res-data.wzd07f_t/Multi Partition Detector-worker1-0000000000007D89.evlog
1562375772.469766,WRK1,EventLog: Rotating log file complete.
1562375772.470939,WRK1,SimpleJournalNoop: dispose_journal
1562375772.471060,WRK1,Sent control message to initializer
1562375773.127091,INIT,Sent control message to worker1
1562375802.192390,INIT,INFO,TCPSink,TCPSink connection closed
This is a bad sequence: the LogRotationResumeBarrierToken doesn't appear in this bad case, but ForwardedInjectBarrierFullyAckedMsg does:
1562619174.718816,INIT,Checkpoint_Initiator: Event Log CheckpointId 3 complete for worker initializer
1562619174.848105,INIT,YO: ExternalMsgDecoder._RotateLog
1562619174.848110,INIT,Received ExternalRotateLogFilesMsg on External Channel
1562619174.861099,WRK1,YO: ExternalMsgDecoder._RotateLog
1562619174.861104,WRK1,Received ExternalRotateLogFilesMsg on External Channel
1562619174.861116,WRK1,Starting event log rotation.
1562619174.861133,WRK1,YO: create ChannelMsg: ForwardInjectBlockingBarrierMsg
1562619174.868901,INIT,Starting event log rotation.
1562619174.873522,INIT,Initiating barrier protocol for LogRotationBarrierToken(worker1->1).
1562619174.873560,INIT,Initiating barrier protocol for LogRotationBarrierToken(initializer->1).
1562619174.969267,INIT,YO: create ChannelMsg: ForwardedInjectBarrierFullyAckedMsg
1562619174.998492,WRK1,EventLog: Rotating log file.
1562619174.998553,WRK1,EMERGENCY,none,START of rotation: finished writing to /tmp/res-data.wgukmh_8/Multi Partition Detector-worker1-0000000000000000.evlog
1562619174.999201,WRK1,EMERGENCY,none,FileBackend: create: filepath = /tmp/res-data.wgukmh_8/Multi Partition Detector-worker1-0000000000007D89.evlog
1562619174.999315,WRK1,EMERGENCY,none,END of rotation: starting writing to /tmp/res-data.wgukmh_8/Multi Partition Detector-worker1-0000000000007D89.evlog
1562619174.999761,WRK1,EventLog: Rotating log file complete.
1562619174.999798,WRK1,YO: create ChannelMsg: ForwardInjectBarrierMsg
1562619204.838838,WRK1,...Removing /tmp/res-data.wgukmh_8/Multi Partition Detector-worker1.evlog...
This is a good sequence:
1562620243.429599,INIT,Checkpoint_Initiator: Event Log CheckpointId 3 complete for worker initializer
1562620243.779193,INIT,YO: ExternalMsgDecoder._RotateLog
1562620243.779198,INIT,Received ExternalRotateLogFilesMsg on External Channel
1562620243.785590,INIT,Starting event log rotation.
1562620243.788272,INIT,Initiating barrier protocol for LogRotationBarrierToken(initializer->1).
1562620243.801342,WRK1,YO: ExternalMsgDecoder._RotateLog
1562620243.801347,WRK1,Received ExternalRotateLogFilesMsg on External Channel
1562620243.806350,WRK1,Starting event log rotation.
1562620243.809208,WRK1,YO: create ChannelMsg: ForwardInjectBlockingBarrierMsg
1562620243.818256,INIT,Initiating barrier protocol for LogRotationBarrierToken(worker1->1).
1562620243.835769,INIT,EventLog: Rotating log file.
1562620243.835828,INIT,EMERGENCY,none,START of rotation: finished writing to /tmp/res-data.s8g9wk2n/multi_partition_nitializer-0000000000000000.evlog
1562620243.836992,INIT,EMERGENCY,none,FileBackend: create: filepath = /tmp/res-data.s8g9wk2n/multi_partition_detector-initializer-0000000000007E5D.evlog
1562620243.837078,INIT,EMERGENCY,none,END of rotation: starting writing to /tmp/res-data.s8g9wk2n/multi_partition_detector-initializer-0000000000007E5D.evlog
1562620243.837353,INIT,EventLog: Rotating log file complete.
1562620243.860118,INIT,YO: create ChannelMsg: ForwardedInjectBarrierFullyAckedMsg
1562620243.870269,WRK1,EventLog: Rotating log file.
1562620243.870325,WRK1,EMERGENCY,none,START of rotation: finished writing to /tmp/res-data.s8g9wk2n/multi_partition_detector-worker1-0000000000000000.evlog
1562620243.871209,WRK1,EMERGENCY,none,FileBackend: create: filepath = /tmp/res-data.s8g9wk2n/multi_partition_detector-worker1-0000000000006ACD.evlog
1562620243.871354,WRK1,EMERGENCY,none,END of rotation: starting writing to /tmp/res-data.s8g9wk2n/multi_partition_detector-worker1-0000000000006ACD.evlog
1562620243.871980,WRK1,EventLog: Rotating log file complete.
1562620243.872535,WRK1,YO: create ChannelMsg: ForwardInjectBarrierMsg
1562620243.875750,INIT,Initiating barrier protocol for LogRotationResumeBarrierToken(worker1->1).
1562620243.909724,INIT,YO: create ChannelMsg: ForwardedInjectBarrierFullyAckedMsg
1562620244.520454,INIT,Checkpoint_Initiator: Event Log CheckpointId 4 complete for worker initializer
This patch "eliminates" the test failures, until the bug can be found & fixed.
diff --git a/testing/tools/integration/cluster.py b/testing/tools/integration/cluster.py
index 174043c72..479db7ae1 100644
--- a/testing/tools/integration/cluster.py
+++ b/testing/tools/integration/cluster.py
@@ -631,6 +631,7 @@ class Cluster(object):
# send a log rotate command directly to each worker's external channel
for w in to_rotate:
send_rotate_command(w.external, w.name)
+ time.sleep(2.0)
# Wait for log rotation watcher to return
wflr.join()
@slfritchie should this be closed now?
No, this ticket should remain open. PR #2960 didn't fix the underlying problem.