actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Fix startup.sh of runner from exiting `Bad file descriptor` error

Open kyontan opened this issue 3 years ago • 15 comments

This may fix https://github.com/actions-runner-controller/actions-runner-controller/issues/1557.

The original line sudo tee $dockerd_config_path 1>&- <<<"$dockerd_config" may intent to write string "$dockerd_config"to$dockerd_config_path without echoing to terminal (1>&-). But it doesn't work in the runner pod's bash (bash 5.0.17(1)) because of Bad file descriptorerror for the part of command:1>&-`.

The 1>&- is a special use of bash's redirection that closes 1 (stdout) and I thought it causes the error. ref: https://www.gnu.org/software/bash/manual/html_node/Redirections.html

To fix the error, I rewrite it to 1>/dev/null.

kyontan avatar Jun 23 '22 08:06 kyontan

Is it possible to add a test so that this doesn't happen again?

BeyondEvil avatar Jun 23 '22 09:06 BeyondEvil

I think it's difficult now. To prevent such a degradation, we need E2E testing that run built image and shutdown.

kyontan avatar Jun 23 '22 10:06 kyontan

Wouldn't it be sufficient to just call the startup script in this case?

BeyondEvil avatar Jun 23 '22 11:06 BeyondEvil

@kyontan FYI, we've reverted the original change in #1561 but we'd certainly revisit this when we redo #1454. Thanks a lot for the detailed report and your fix! Good job 👍

mumoshu avatar Jun 28 '22 23:06 mumoshu

@mumoshu this is the proper fix for the problem encountered in https://github.com/actions-runner-controller/actions-runner-controller/issues/1557. The 1>&- closes the file handle, but if a program treats the file handle as a file and it is closed it fails. Switching from 1>&- to 1>/dev/null ensures that there is a file handle that can be stated.

What I don't understand is why this was not a problem for anyone of us while testing. But, we all knew that our test coverage for this is not sufficient. We need proper integration tests, nothing else can help here in the future.

Fleshgrinder avatar Jun 30 '22 06:06 Fleshgrinder

@Fleshgrinder Hey! Thanks for sharing your insight. I did try reproducing it with a bash snippet and it did reproduce on my macOS with bash 3.x, but it didn't on my “bash” version 3/4/5 containers on linux(I thought our runner images are also based on ubuntu 20.04 so perhaps it's something host-os specific...? hard to believe) . So, I don't really know what kind of integration tests could have caught it. Before coming up with a perfect integration test that covers issues like that, perhaps we'd better have a guideline to write a portable bash script in the first place... or rewrite it in Go which should be more portable(and easy to test).

mumoshu avatar Jun 30 '22 07:06 mumoshu

The proper test would be to spin it up in Kubernetes. The problem is not with Bash, the problem is with tee. We have tee 8.30 inside our image and the following reproducer directly fails:

$ docker run -it --rm --entrypoint bash summerwind/actions-runner -c 'cd /tmp; tee test 1>&- <<<test'
Unable to find image 'summerwind/actions-runner:latest' locally
latest: Pulling from summerwind/actions-runner
Digest: sha256:f587d9b1ffb2e8ca3e28b42c956d498f066678b142b836cad2687b435f306bd1
Status: Downloaded newer image for summerwind/actions-runner:latest
tee: 'standard output': Bad file descriptor

Why nobody of us caught this is beyond me, as it insta fails.

I don't see how rewriting it in Go would help. Unless the entire Go program is executed as super user which would spare us the tee call, and allow us to write to files owned by root. However, the same could be done with Bash, but then every action is done with the super user, which we don't want.

Using >&- is proper Bash, but obviously it needs to be tested if the target program can handle closed fds, or not. Just bad luck, really.

Fleshgrinder avatar Jun 30 '22 08:06 Fleshgrinder

Btw. it is very common that programs expect stdout to be a proper fd, but are lenient in regards to stderr. So, I would refrain from actually stating that tee is broken, wrong, or anything. It's my fault, I should have known better, and it should have been caught while testing, even if just locally (as illustrated by the simple reproducer).

Fleshgrinder avatar Jun 30 '22 09:06 Fleshgrinder

Thanks for your response!

I don't see how rewriting it in Go would help

The only reason is Go is seemingly more portable and easier for me. At least, we'd have no chance to make it fail like this when we coded a Go program to not forward stdout from child processes.

I had been sticking with bash-based entrypoint and startup scripts just because I thought it might be easier to maintain, but if it started to look untrue.

Even if we had a comprehensive(it can't be comprehensive btw) integration test, it doesn't justify maintaining advanced bash scripts. We'd better architect it correctly and then integration-test it, rather than integration-testing it to make bad arch ok.

mumoshu avatar Jun 30 '22 09:06 mumoshu

Oh, don't get me wrong, I'm not trying to convince you to not do it in Go. Quite the contrary, the only reason I'm working on the Bash scripts is because they are, well, sorry to be so frank, bad. Hence, if you (the maintainers) feel like you can do it better in Go, please go ahead.

My only concern is that it is always very hard with normal programming languages to do this kind of stuff, without shelling out all the time. We are interacting with a lot of programs here, switching between current and super user, etc. pp. This is simply what Bash was intended for.

Also, imho, there is no reason to panic just because of one bug. There is absolutely no guarantee to not have bugs in any other program without tests. And, well, we do not have tests for this, like, none at all. This particular issue would have been found right away if we would have had a test suite. I am much more concerned about all the MTU stuff we are doing all over the place, this seems more complicated to verify.

😊

Fleshgrinder avatar Jun 30 '22 10:06 Fleshgrinder

This particular issue would have been found right away if we would have had a test suite

True! So the most needed contribution to this area would be to add tests, rather than to make it better... It's sad that I had been short on time to prioritize that. If you think we can keep entrypoint.sh and startup.sh maintainable and testable, that's good, if anyone could contribute tests for startup.sh, that would also be good (because then I don't need to rewrite it in Go)

It's just that I had observed no positive signals to convince me that those scripts are maintainable these days.

mumoshu avatar Jun 30 '22 11:06 mumoshu

I don't want to convince you. It's a question for you, the core maintainers. Happy to think about ways of testing and contribute them, also happy to continue supporting with the Bash stuff. But, no matter what, we have to keep it to the absolute minimum. Bash isn't simple. We also need the ability to actually deploy.

The entire goal of the improvements is that we can introduce shellcheck and shfmt. These are going to help every contributor to write better scripts.

Fleshgrinder avatar Jun 30 '22 11:06 Fleshgrinder

@Fleshgrinder Thanks for the suggestion and your clarification! Let me say I do agree with the goal. I just wanted to make it crystal clear that I do take the reliability and maintainability of ARC seriously and that I'm willing to do whatever I can to improve it. Bash vs Go is not my goal but an implementation detail/idea.

mumoshu avatar Jun 30 '22 12:06 mumoshu

Couldn't agree more. But, if Bash, good Bash. 😉

Fleshgrinder avatar Jun 30 '22 12:06 Fleshgrinder

+1 on shellcheck and shfmt.

BeyondEvil avatar Jul 09 '22 14:07 BeyondEvil

Thank you for raising a PR to fix the broken image, I opted to just do a revert of the PR as that was quicker and sure to "fix" the problem. Since then the scripting has been overhauled so this PR is no longer needed. Thanks again for your work though!.

toast-gear avatar Nov 26 '22 09:11 toast-gear