quickfixj icon indicating copy to clipboard operation
quickfixj copied to clipboard

analyze flaky TimerTest

Open chrjohn opened this issue 4 years ago • 6 comments
trafficstars

With additional logging we can see that there is no heartbeat sent for three seconds. Added generation of thread dump after two seconds...

<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Accepting session FIX.4.4:ISLD->TW from /127.0.0.1:50618)
<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Acceptor heartbeat set to 2 seconds)
<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Received logon)
<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Responding to Logon request)
<20201217-13:18:19, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=6035=A34=149=ISLD52=20201217-13:18:19.87156=TW98=0108=210=168)
<20201217-13:18:19, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.49=6035=A34=149=ISLD52=20201217-13:18:19.87156=TW98=0108=210=168)
<20201217-13:18:19, FIX.4.4:TW->ISLD, event> (Received logon)
<20201217-13:18:22, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=5835=134=249=ISLD52=20201217-13:18:22.94856=TW112=TEST10=201)
<20201217-13:18:22, FIX.4.4:ISLD->TW, event> (Sent test request TEST)

chrjohn avatar Dec 17 '20 13:12 chrjohn

Error:  Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.991 s <<< FAILURE! - in quickfix.SocketAcceptorTest
Error:  testQuickRestartOfAcceptor(quickfix.SocketAcceptorTest)  Time elapsed: 3.455 s  <<< FAILURE!
java.lang.AssertionError: Exactly 1 'QFJ Message Processor' thread(s) expected expected:<1> but was:<2>

chrjohn avatar Dec 28 '20 23:12 chrjohn

Current findings:

  • IIRC I only observed this on MacOS.
  • Strange thing is that no Heartbeat is generated for either client or server but the TestRequest because of the missing Heartbeat is generated on client and server although both are triggered by the QFJ Timer thread.
2020-12-29T00:51:00.6306730Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Accepting session FIX.4.4:ISLD->TW from /127.0.0.1:50680)
2020-12-29T00:51:00.6307980Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Acceptor heartbeat set to 2 seconds)
2020-12-29T00:51:00.6309220Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Received logon)
2020-12-29T00:51:00.6310180Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Responding to Logon request)
2020-12-29T00:51:00.6311230Z <20201229-00:51:00, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=6035=A34=149=ISLD52=20201229-00:51:00.28456=TW98=0108=210=152)
2020-12-29T00:51:00.6312240Z <20201229-00:51:00, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.49=6035=A34=149=ISLD52=20201229-00:51:00.28456=TW98=0108=210=152)
2020-12-29T00:51:00.6313170Z <20201229-00:51:00, FIX.4.4:TW->ISLD, event> (Received logon)

# nothing happening here.... for three seconds
# QFJ Timer should trigger generation of Heartbeat after 2 seconds but it doesn't
# but TestRequest after about 3 seconds is generated

2020-12-29T00:51:03.3023960Z <20201229-00:51:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=5835=134=249=ISLD52=20201229-00:51:03.30156=TW112=TEST10=179)
2020-12-29T00:51:03.3071120Z <20201229-00:51:03, FIX.4.4:ISLD->TW, event> (Sent test request TEST)
2020-12-29T00:51:03.3085660Z <20201229-00:51:03, FIX.4.4:TW->ISLD, outgoing> (8=FIX.4.49=5835=134=249=TW52=20201229-00:51:03.30156=ISLD112=TEST10=179)
2020-12-29T00:51:03.3089020Z <20201229-00:51:03, FIX.4.4:TW->ISLD, event> (Sent test request TEST)

Edit: sounds a little like https://www.quickfixj.org/jira/browse/QFJ-448 where Heartbeats are occassionally missing, especially with low Heartbeat intervals...

chrjohn avatar Dec 30 '20 23:12 chrjohn

After some more failures it seems indeed to happen on MacOS only. The timing seems to be a bit off, causing the heartbeat check to return false (normally isHeartBeatNeeded() should be called once per second):

2020-12-31T22:09:00.0061350Z <20201231-22:09:00, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=6035=A34=149=ISLD52=20201231-22:09:00.00556=TW98=0108=210=143)
2020-12-31T22:09:00.0062670Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false
2020-12-31T22:09:00.0063950Z <20201231-22:09:00, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.49=6035=A34=149=ISLD52=20201231-22:09:00.00556=TW98=0108=210=143)
2020-12-31T22:09:00.0065060Z <20201231-22:09:00, FIX.4.4:TW->ISLD, event> (Received logon)
2020-12-31T22:09:00.0066220Z XXX FIX.4.4:TW->ISLD isHeartBeatNeeded() = false
# more than one second pause
2020-12-31T22:09:01.0259920Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false
2020-12-31T22:09:01.0261070Z XXX FIX.4.4:TW->ISLD isHeartBeatNeeded() = false
# less than one second pause
2020-12-31T22:09:01.9287190Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false
2020-12-31T22:09:01.9288290Z XXX FIX.4.4:TW->ISLD isHeartBeatNeeded() = false
# more than one second pause
2020-12-31T22:09:03.0341840Z <20201231-22:09:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=5835=134=249=ISLD52=20201231-22:09:03.02956=TW112=TEST10=186)
2020-12-31T22:09:03.0343040Z <20201231-22:09:03, FIX.4.4:ISLD->TW, event> (Sent test request TEST)
2020-12-31T22:09:03.0344100Z <20201231-22:09:03, FIX.4.4:TW->ISLD, outgoing> (8=FIX.4.49=5835=134=249=TW52=20201231-22:09:03.03056=ISLD112=TEST10=178)
2020-12-31T22:09:03.0345080Z <20201231-22:09:03, FIX.4.4:TW->ISLD, event> (Sent test request TEST)
2020-12-31T22:09:03.0346140Z <20201231-22:09:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=6135=M34=349=ISLD52=20201231-22:09:03.03056=TW66=somelist10=209)
2020-12-31T22:09:03.0347180Z <20201231-22:09:03, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.49=5835=134=249=ISLD52=20201231-22:09:03.02956=TW112=TEST10=186)
2020-12-31T22:09:03.0348230Z <20201231-22:09:03, FIX.4.4:ISLD->TW, incoming> (8=FIX.4.49=5835=134=249=TW52=20201231-22:09:03.03056=ISLD112=TEST10=178)
2020-12-31T22:09:03.0349280Z <20201231-22:09:03, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.49=6135=M34=349=ISLD52=20201231-22:09:03.03056=TW66=somelist10=209)
2020-12-31T22:09:03.0350330Z <20201231-22:09:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.49=5835=034=449=ISLD52=20201231-22:09:03.03156=TW112=TEST10=180)
2020-12-31T22:09:03.0351280Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false

Maybe we should just ignore it, otherwise I am asking myself why it only happens on MacOS.

chrjohn avatar Jan 18 '21 13:01 chrjohn

How often does it happen? I tried to recreate it, but I modified the YAML script slightly to run "TimerTest" only and after 5 attempts it passed every time on MacOS Java 8.

Either it takes more attempts to fail or it makes me think that some other test case does not properly close/recycle a resource on MacOS and interferes with this test.

the-thing avatar Jan 19 '21 10:01 the-thing

In the complete test suite I think it failed about every two or three runs. I can imagine that it has to do something with resources but to me it looks like some kind of OS-resource since the timer seems to fire very unreliably as if there is a higher CPU load.

chrjohn avatar Jan 19 '21 11:01 chrjohn

The test uses a java.util.Timer which is sensitive to system clock changes (which seem to happen from time to time on virtualized systems). Will test if converting to use ScheduledExecutorService fixes the issues.

References: https://stackoverflow.com/questions/409932/java-timer-vs-executorservice https://stackoverflow.com/questions/17588167/what-should-timertask-scheduleatfixedrate-do-if-the-clock-changes https://stackoverflow.com/questions/18803695/how-is-the-timer-class-in-java-sensitive-to-the-system-clock https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4290274

chrjohn avatar Apr 18 '21 23:04 chrjohn