br Test pauseSuite.TestPause fails

What did you do? If possible, provide a recipe for reproducing the error.

[2021-03-02T11:32:19.313Z] FAIL: pause_test.go:62: pauseSuite.TestPause
[2021-03-02T11:32:19.313Z] 
[2021-03-02T11:32:19.313Z] pause_test.go:99:
[2021-03-02T11:32:19.313Z]     c.Assert(&wg, unblocksBetween, 500*time.Millisecond, 520*time.Millisecond)
[2021-03-02T11:32:19.313Z] ... waitGroupPtr *sync.WaitGroup = &sync.WaitGroup{noCopy:sync.noCopy{}, state1:[3]uint32{0x0, 0x0, 0x0}}
[2021-03-02T11:32:19.313Z] ... min time.Duration = 500000000 ("500ms")
[2021-03-02T11:32:19.313Z] ... max time.Duration = 520000000 ("520ms")
[2021-03-02T11:32:19.313Z] ... WaitGroup did not unblock after maximum duration

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5062/pipeline/69#step-360-log-118

Mar 02 '21 11:03 overvenus

I'm afraid this was caused by random pause of virtual machines / containers...(Or, GC?) Anyway, let's try to reproduce this.

Mar 02 '21 15:03 YuJuncen

For making sure there isn't a deadlock or something, I set max unblock time to 2s and run it about 4,000 times, but no WaitGroup did not unblock after maximum duration found.

(However, there is a WaitGroup unblocked before minimum duration found...🤔)

With this script:

#!/bin/bash
i=0
success=0
while ((i<10)); do
if ! go test -count=1 github.com/pingcap/br/pkg/lightning/common 2>&1 >test_result.txt; then
        cp test_result.txt test_result_failed$i.txt
        ((i++))
else
        ((success++))
        if (( success % 50 == 1)); then
                echo -e "\rsuccess $success times..."
        fi
        if (( success > 1000000 )); then
                break
        fi
fi
done

Mar 03 '21 01:03 YuJuncen

Other failed tests:

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5142/pipeline#step-1011-log-44
https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5191/pipeline#step-477-log-117
https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5202/pipeline#step-282-log-186
https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5197/pipeline/69#step-1067-log-42
https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5251/pipeline

Mar 03 '21 06:03 overvenus

This time WaitGroup unblocked before minimum duration...

Mar 03 '21 07:03 YuJuncen

Well, I changed the checker unblocksBetween, when it failed, it would print the real duration it cost to got the waitgroup unblocked.

With this version, I will keep running the script above, if we got real wait time like 498ms, 523ms, etc., I think this problem would probably caused by container time drifting, or something likewise(say, GC). Then maybe we can loose the condition of our test case.

Or, would there be other magic made this all? Let me check those code again.

Mar 03 '21 07:03 YuJuncen

3/800 times failed.

Info:

WaitGroup unblocked before minimum duration, it was 499.294605ms
WaitGroup unblocked before minimum duration, it was 499.971467ms
WaitGroup unblocked before minimum duration, it was 499.114981ms

Mar 03 '21 07:03 YuJuncen

Thanks to @3pointer, who found that there are two goroutines racing: https://github.com/pingcap/br/blob/a1e4b0154711dd285ae57bffcf4777f3ad9a72db/pkg/lightning/common/pause_test.go#L128-L131 ..and https://github.com/pingcap/br/blob/a1e4b0154711dd285ae57bffcf4777f3ad9a72db/pkg/lightning/common/pause_test.go#L42-L46

If the former one scheduled firstly, the time.Since(start) would be less than 500ms.

How about WaitGroup did not unblock after maximum duration? GC, I guess. This case is rarer than the former(i.e. WaitGroup unblocked before minimum duration), and wait time exceeds only milliseconds. Set the upper bound to 530ms probably works.

Mar 03 '21 09:03 YuJuncen

Well, I changed the checker unblocksBetween, when it failed, it would print the real duration it cost to got the waitgroup unblocked.

With this version, I will keep running the script above, if we got real wait time like 498ms, 523ms, etc., I think this problem would probably caused by container time drifting, or something likewise(say, GC). Then maybe we can loose the condition of our test case.

Or, would there be other magic made this all? Let me check those code again.

I have looked at the test code again: https://github.com/pingcap/br/blob/87ed62884b0f5df0e951d7ee9cd5fb5d0aaaa41c/pkg/lightning/common/pause_test.go#L94-L99

the time difference is actually the time delay of L99 start after L94~97 start. In theory, they should start at the same time, but in practice, they are likely not. So I think start the timer for unblocksChecker before L94 should solve this.

Mar 03 '21 11:03 glorv

@glorv I would prefer to move start := time.Now() at L43 outside the go block. Because the c.Assert(...) statement would block.

Let the script run for more time, if all right, I think a PR can be provided tomorrow.

Mar 03 '21 11:03 YuJuncen

New failed tests:

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5343/pipeline

Mar 04 '21 08:03 overvenus

Another failed case:

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_build/detail/br_ghpr_build/3316/pipeline

Seem waiting for 800ms still can't avoid this issue.

[2021-03-10T10:51:05.341Z] ----------------------------------------------------------------------

[2021-03-10T10:51:05.341Z] FAIL: pause_test.go:67: pauseSuite.TestPause

[2021-03-10T10:51:05.341Z] 

[2021-03-10T10:51:05.341Z] pause_test.go:106:

[2021-03-10T10:51:05.341Z]     // Give them more time to unblock in case of time exceeding due to high pressure of CI.

[2021-03-10T10:51:05.341Z]     c.Assert(&wg, unblocksBetween, 500*time.Millisecond, 800*time.Millisecond)

[2021-03-10T10:51:05.341Z] ... waitGroupPtr *sync.WaitGroup = &sync.WaitGroup{noCopy:sync.noCopy{}, state1:[3]uint32{0x0, 0x0, 0x0}}

[2021-03-10T10:51:05.341Z] ... min time.Duration = 500000000 ("500ms")

[2021-03-10T10:51:05.341Z] ... max time.Duration = 800000000 ("800ms")

[2021-03-10T10:51:05.341Z] ... WaitGroup did not unblock after maximum duration, it was 814.985517ms

[2021-03-10T10:51:05.341Z] 

[2021-03-10T10:51:05.341Z] OOPS: 16 passed, 1 FAILED

[2021-03-10T10:51:05.341Z] --- FAIL: TestCommon (6.99s)

Mar 10 '21 11:03 glorv

Another failed case: https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/6030/pipeline#step-308-log-123

Mar 11 '21 11:03 overvenus

I think the problem here is the upper bound of wait time should be infinite(since CPU can pause a process for arbitrary time), but we must specify an upper bound for making sure there isn't a dead lock...

Mar 18 '21 08:03 YuJuncen

br br copied to clipboard

Test pauseSuite.TestPause fails

br
br copied to clipboard