br
br copied to clipboard
Test pauseSuite.TestPause fails
- What did you do? If possible, provide a recipe for reproducing the error.
[2021-03-02T11:32:19.313Z] FAIL: pause_test.go:62: pauseSuite.TestPause
[2021-03-02T11:32:19.313Z]
[2021-03-02T11:32:19.313Z] pause_test.go:99:
[2021-03-02T11:32:19.313Z] c.Assert(&wg, unblocksBetween, 500*time.Millisecond, 520*time.Millisecond)
[2021-03-02T11:32:19.313Z] ... waitGroupPtr *sync.WaitGroup = &sync.WaitGroup{noCopy:sync.noCopy{}, state1:[3]uint32{0x0, 0x0, 0x0}}
[2021-03-02T11:32:19.313Z] ... min time.Duration = 500000000 ("500ms")
[2021-03-02T11:32:19.313Z] ... max time.Duration = 520000000 ("520ms")
[2021-03-02T11:32:19.313Z] ... WaitGroup did not unblock after maximum duration
https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5062/pipeline/69#step-360-log-118
I'm afraid this was caused by random pause of virtual machines / containers...(Or, GC?) Anyway, let's try to reproduce this.
For making sure there isn't a deadlock or something, I set max unblock time to 2s and run it about 4,000 times, but no WaitGroup did not unblock after maximum duration
found.
(However, there is a WaitGroup unblocked before minimum duration
found...🤔)
With this script:
#!/bin/bash
i=0
success=0
while ((i<10)); do
if ! go test -count=1 github.com/pingcap/br/pkg/lightning/common 2>&1 >test_result.txt; then
cp test_result.txt test_result_failed$i.txt
((i++))
else
((success++))
if (( success % 50 == 1)); then
echo -e "\rsuccess $success times..."
fi
if (( success > 1000000 )); then
break
fi
fi
done
Other failed tests:
- https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5142/pipeline#step-1011-log-44
- https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5191/pipeline#step-477-log-117
- https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5202/pipeline#step-282-log-186
- https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5197/pipeline/69#step-1067-log-42
- https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5251/pipeline
This time WaitGroup unblocked before minimum duration
...
Well, I changed the checker unblocksBetween
, when it failed, it would print the real duration it cost to got the waitgroup
unblocked.
With this version, I will keep running the script above, if we got real wait time like 498ms
, 523ms
, etc., I think this problem would probably caused by container time drifting, or something likewise(say, GC). Then maybe we can loose the condition of our test case.
Or, would there be other magic made this all? Let me check those code again.
3/800 times failed.
Info:
WaitGroup unblocked before minimum duration, it was 499.294605ms
WaitGroup unblocked before minimum duration, it was 499.971467ms
WaitGroup unblocked before minimum duration, it was 499.114981ms
Thanks to @3pointer, who found that there are two goroutines racing: https://github.com/pingcap/br/blob/a1e4b0154711dd285ae57bffcf4777f3ad9a72db/pkg/lightning/common/pause_test.go#L128-L131 ..and https://github.com/pingcap/br/blob/a1e4b0154711dd285ae57bffcf4777f3ad9a72db/pkg/lightning/common/pause_test.go#L42-L46
If the former one scheduled firstly, the time.Since(start)
would be less than 500ms.
How about WaitGroup did not unblock after maximum duration
? GC, I guess. This case is rarer than the former(i.e. WaitGroup unblocked before minimum duration
), and wait time exceeds only milliseconds. Set the upper bound to 530ms probably works.
Well, I changed the checker
unblocksBetween
, when it failed, it would print the real duration it cost to got thewaitgroup
unblocked.With this version, I will keep running the script above, if we got real wait time like
498ms
,523ms
, etc., I think this problem would probably caused by container time drifting, or something likewise(say, GC). Then maybe we can loose the condition of our test case.Or, would there be other magic made this all? Let me check those code again.
I have looked at the test code again: https://github.com/pingcap/br/blob/87ed62884b0f5df0e951d7ee9cd5fb5d0aaaa41c/pkg/lightning/common/pause_test.go#L94-L99
the time difference is actually the time delay of L99 start after L94~97 start. In theory, they should start at the same time, but in practice, they are likely not. So I think start the timer for unblocksChecker
before L94 should solve this.
@glorv I would prefer to move start := time.Now()
at L43 outside the go block. Because the c.Assert(...)
statement would block.
Let the script run for more time, if all right, I think a PR can be provided tomorrow.
New failed tests:
- https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5343/pipeline
Another failed case:
https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_build/detail/br_ghpr_build/3316/pipeline
Seem waiting for 800ms
still can't avoid this issue.
[2021-03-10T10:51:05.341Z] ----------------------------------------------------------------------
[2021-03-10T10:51:05.341Z] FAIL: pause_test.go:67: pauseSuite.TestPause
[2021-03-10T10:51:05.341Z]
[2021-03-10T10:51:05.341Z] pause_test.go:106:
[2021-03-10T10:51:05.341Z] // Give them more time to unblock in case of time exceeding due to high pressure of CI.
[2021-03-10T10:51:05.341Z] c.Assert(&wg, unblocksBetween, 500*time.Millisecond, 800*time.Millisecond)
[2021-03-10T10:51:05.341Z] ... waitGroupPtr *sync.WaitGroup = &sync.WaitGroup{noCopy:sync.noCopy{}, state1:[3]uint32{0x0, 0x0, 0x0}}
[2021-03-10T10:51:05.341Z] ... min time.Duration = 500000000 ("500ms")
[2021-03-10T10:51:05.341Z] ... max time.Duration = 800000000 ("800ms")
[2021-03-10T10:51:05.341Z] ... WaitGroup did not unblock after maximum duration, it was 814.985517ms
[2021-03-10T10:51:05.341Z]
[2021-03-10T10:51:05.341Z] OOPS: 16 passed, 1 FAILED
[2021-03-10T10:51:05.341Z] --- FAIL: TestCommon (6.99s)
Another failed case: https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/6030/pipeline#step-308-log-123
I think the problem here is the upper bound of wait time should be infinite(since CPU can pause a process for arbitrary time), but we must specify an upper bound for making sure there isn't a dead lock...