br icon indicating copy to clipboard operation
br copied to clipboard

Test pauseSuite.TestPause fails

Open overvenus opened this issue 4 years ago • 13 comments

  1. What did you do? If possible, provide a recipe for reproducing the error.
[2021-03-02T11:32:19.313Z] FAIL: pause_test.go:62: pauseSuite.TestPause
[2021-03-02T11:32:19.313Z] 
[2021-03-02T11:32:19.313Z] pause_test.go:99:
[2021-03-02T11:32:19.313Z]     c.Assert(&wg, unblocksBetween, 500*time.Millisecond, 520*time.Millisecond)
[2021-03-02T11:32:19.313Z] ... waitGroupPtr *sync.WaitGroup = &sync.WaitGroup{noCopy:sync.noCopy{}, state1:[3]uint32{0x0, 0x0, 0x0}}
[2021-03-02T11:32:19.313Z] ... min time.Duration = 500000000 ("500ms")
[2021-03-02T11:32:19.313Z] ... max time.Duration = 520000000 ("520ms")
[2021-03-02T11:32:19.313Z] ... WaitGroup did not unblock after maximum duration

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5062/pipeline/69#step-360-log-118

overvenus avatar Mar 02 '21 11:03 overvenus

I'm afraid this was caused by random pause of virtual machines / containers...(Or, GC?) Anyway, let's try to reproduce this.

YuJuncen avatar Mar 02 '21 15:03 YuJuncen

For making sure there isn't a deadlock or something, I set max unblock time to 2s and run it about 4,000 times, but no WaitGroup did not unblock after maximum duration found.

(However, there is a WaitGroup unblocked before minimum duration found...🤔)

With this script:
#!/bin/bash
i=0
success=0
while ((i<10)); do
if ! go test -count=1 github.com/pingcap/br/pkg/lightning/common 2>&1 >test_result.txt; then
        cp test_result.txt test_result_failed$i.txt
        ((i++))
else
        ((success++))
        if (( success % 50 == 1)); then
                echo -e "\rsuccess $success times..."
        fi
        if (( success > 1000000 )); then
                break
        fi
fi
done

YuJuncen avatar Mar 03 '21 01:03 YuJuncen

Other failed tests:

  • https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5142/pipeline#step-1011-log-44
  • https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5191/pipeline#step-477-log-117
  • https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5202/pipeline#step-282-log-186
  • https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5197/pipeline/69#step-1067-log-42
  • https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5251/pipeline

overvenus avatar Mar 03 '21 06:03 overvenus

This time WaitGroup unblocked before minimum duration...

YuJuncen avatar Mar 03 '21 07:03 YuJuncen

Well, I changed the checker unblocksBetween, when it failed, it would print the real duration it cost to got the waitgroup unblocked.

With this version, I will keep running the script above, if we got real wait time like 498ms, 523ms, etc., I think this problem would probably caused by container time drifting, or something likewise(say, GC). Then maybe we can loose the condition of our test case.

Or, would there be other magic made this all? Let me check those code again.

YuJuncen avatar Mar 03 '21 07:03 YuJuncen

3/800 times failed.

Info:

WaitGroup unblocked before minimum duration, it was 499.294605ms
WaitGroup unblocked before minimum duration, it was 499.971467ms
WaitGroup unblocked before minimum duration, it was 499.114981ms

YuJuncen avatar Mar 03 '21 07:03 YuJuncen

Thanks to @3pointer, who found that there are two goroutines racing: https://github.com/pingcap/br/blob/a1e4b0154711dd285ae57bffcf4777f3ad9a72db/pkg/lightning/common/pause_test.go#L128-L131 ..and https://github.com/pingcap/br/blob/a1e4b0154711dd285ae57bffcf4777f3ad9a72db/pkg/lightning/common/pause_test.go#L42-L46

If the former one scheduled firstly, the time.Since(start) would be less than 500ms.

How about WaitGroup did not unblock after maximum duration? GC, I guess. This case is rarer than the former(i.e. WaitGroup unblocked before minimum duration), and wait time exceeds only milliseconds. Set the upper bound to 530ms probably works.

YuJuncen avatar Mar 03 '21 09:03 YuJuncen

Well, I changed the checker unblocksBetween, when it failed, it would print the real duration it cost to got the waitgroup unblocked.

With this version, I will keep running the script above, if we got real wait time like 498ms, 523ms, etc., I think this problem would probably caused by container time drifting, or something likewise(say, GC). Then maybe we can loose the condition of our test case.

Or, would there be other magic made this all? Let me check those code again.

I have looked at the test code again: https://github.com/pingcap/br/blob/87ed62884b0f5df0e951d7ee9cd5fb5d0aaaa41c/pkg/lightning/common/pause_test.go#L94-L99

the time difference is actually the time delay of L99 start after L94~97 start. In theory, they should start at the same time, but in practice, they are likely not. So I think start the timer for unblocksChecker before L94 should solve this.

glorv avatar Mar 03 '21 11:03 glorv

@glorv I would prefer to move start := time.Now() at L43 outside the go block. Because the c.Assert(...) statement would block.

Let the script run for more time, if all right, I think a PR can be provided tomorrow.

YuJuncen avatar Mar 03 '21 11:03 YuJuncen

New failed tests:

  • https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/5343/pipeline

overvenus avatar Mar 04 '21 08:03 overvenus

Another failed case:

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_build/detail/br_ghpr_build/3316/pipeline

Seem waiting for 800ms still can't avoid this issue.

[2021-03-10T10:51:05.341Z] ----------------------------------------------------------------------

[2021-03-10T10:51:05.341Z] FAIL: pause_test.go:67: pauseSuite.TestPause

[2021-03-10T10:51:05.341Z] 

[2021-03-10T10:51:05.341Z] pause_test.go:106:

[2021-03-10T10:51:05.341Z]     // Give them more time to unblock in case of time exceeding due to high pressure of CI.

[2021-03-10T10:51:05.341Z]     c.Assert(&wg, unblocksBetween, 500*time.Millisecond, 800*time.Millisecond)

[2021-03-10T10:51:05.341Z] ... waitGroupPtr *sync.WaitGroup = &sync.WaitGroup{noCopy:sync.noCopy{}, state1:[3]uint32{0x0, 0x0, 0x0}}

[2021-03-10T10:51:05.341Z] ... min time.Duration = 500000000 ("500ms")

[2021-03-10T10:51:05.341Z] ... max time.Duration = 800000000 ("800ms")

[2021-03-10T10:51:05.341Z] ... WaitGroup did not unblock after maximum duration, it was 814.985517ms

[2021-03-10T10:51:05.341Z] 

[2021-03-10T10:51:05.341Z] OOPS: 16 passed, 1 FAILED

[2021-03-10T10:51:05.341Z] --- FAIL: TestCommon (6.99s)

glorv avatar Mar 10 '21 11:03 glorv

Another failed case: https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/br_ghpr_unit_and_integration_test/detail/br_ghpr_unit_and_integration_test/6030/pipeline#step-308-log-123

overvenus avatar Mar 11 '21 11:03 overvenus

I think the problem here is the upper bound of wait time should be infinite(since CPU can pause a process for arbitrary time), but we must specify an upper bound for making sure there isn't a dead lock...

YuJuncen avatar Mar 18 '21 08:03 YuJuncen