cli flaky test: TestPromptForConfirmation/case=reader

Description

test was added on https://github.com/docker/cli/commit/10bf91a02d2abd4dec90b79bd10b91f6fbb8e05d / https://github.com/docker/cli/pull/4888

This test looks to be flaky on macOS at least; seen it fail a couple of times.

ok  	github.com/docker/cli/cli	0.242s	coverage: 28.2% of statements
--- FAIL: TestPromptForConfirmation (0.38s)
    --- FAIL: TestPromptForConfirmation/case=reader_closed (0.19s)
        utils_test.go:222: PromptForConfirmation did not return after promptReader was closed
FAIL
panic: send on closed channel

goroutine 111 [running]:
github.com/docker/cli/cli/command_test.TestPromptForConfirmation.func8.2()
	/Users/runner/work/cli/cli/src/github.com/docker/cli/cli/command/utils_test.go:209 +0x6d
created by github.com/docker/cli/cli/command_test.TestPromptForConfirmation.func8 in goroutine 110
	/Users/runner/work/cli/cli/src/github.com/docker/cli/cli/command/utils_test.go:207 +0x3aa

Mar 17 '24 13:03 thaJeztah

cc @Benehiko

Mar 17 '24 13:03 thaJeztah

Good point from @tonistiigi here – https://github.com/docker/cli/pull/4888#discussion_r1525644870.

A more holistic approach would be to always set up signal handling in the CLI, and use it to cancel the command's context, and just migrate most places in the codebase that are setting up their own signal handling to instead handle context cancellation – funny that we do this for plugins but not for anything else.

Mar 18 '24 11:03 laurazard

Ah, yes, I saw that comment, and wanted to ask @krissetto about it as well.

A more holistic approach would be to always set up signal handling in the CLI, and use it to cancel the command's context,

Sounds like a sensible approach yes (or at least, that's my first response; contexts are always fun)

Mar 18 '24 11:03 thaJeztah

Good point from @tonistiigi here – #4888 (comment).

A more holistic approach would be to always set up signal handling in the CLI, and use it to cancel the command's context, and just migrate most places in the codebase that are setting up their own signal handling to instead handle context cancellation – funny that we do this for plugins but not for anything else.

I think this can only work once we have wired up context.Context. There are some PRs I've already made for this to support otel tracing. So I would think we could move the signal termination out from the prompt function to the main function running the cobra command once these PRs are merged.

Mar 18 '24 11:03 Benehiko

Good catch. I agree with always setting up a signal handler in the CLI and cancelling with the context, but i'm not sure how many changes it'd require in the codebase ATM to pass the context everywhere it might be needed.

I think what confused me was this part of the signal.NotifyContext docs:

The stop function unregisters the signal behavior, which, like signal.Reset, may restore the default behavior for a given signal. For example, the default behavior of a Go program receiving os.Interrupt is to exit. Calling NotifyContext(parent, os.Interrupt) will change the behavior to cancel the returned context.

I thought that once a signal got consumed, the original behavior would be restored (so the default signal handling behavior of the application).

That said, I think this specific panic is only related to the test, as the code itself doesn't explicitly close any channels (the result chan in this case) in a way that could result in this issue.

Particularly this part at lines 205-210 of utils_test.go might be at fault

result := make(chan promptResult, 1)
defer close(result)
go func() {
    r, err := command.PromptForConfirmation(ctx, promptReader, promptOut, "")
    result <- promptResult{r, err}
}()

Mar 18 '24 12:03 krissetto

Oh! I think I meant to @ @Benehiko, and for some reason pinged @krissetto 🙈 - well, both of you are here now, so all good 😂

Mar 18 '24 12:03 thaJeztah

Looking at this error, I think the timeout might be too little for some machines. I've not been able to reproduce this flake locally on my laptop so the flake might be dependent on the machine running it.

I'm guessing the timeout for the context is reached (100 milliseconds) and then the test is ended with t.Fatal() in the process of this, the function executing the test exits closing the result channel. However, the goroutine is still running waiting for a result from the prompt function. The prompt does return eventually and tries to send on the closed result channel causing the program to panic.

This is a tricky one since there is no way to guarantee that closing the reader will exit the prompt in a given time frame since this would be dependent on the OS and machine hardware. I could prevent the panic by not closing the channels inside the test and leaving it up to the garbage collector and hope that an increase to say 500ms or 1 second timeout is enough for most cases.

Mar 21 '24 09:03 Benehiko

flaky test: TestPromptForConfirmation/case=reader_closed

Description