test-infra
test-infra copied to clipboard
Nova Linux build job runs as root inside its Docker container
The container in question https://hub.docker.com/r/pytorch/manylinux-builder/tags.
There are reports from ExecuTorch and Torchtune about the problematic way of running Nova Linux build job as root inside the container.
- On ExecuTorch, buck2 refuses to run as root https://github.com/pytorch/executorch/actions/runs/8655199802/job/23733768589#step:14:125, which blocks ET from using the workflow. For the context, ET use buck2 to gather the source files before passing them to cmake.
- On TorchTune, running the build as root leaves some artifacts like the conda folder on the runner, which could prevent subsequent jobs to clean it up as @clee2000 discovered it https://github.com/pytorch/torchtune/actions/runs/8639310064/job/23685409883?pr=688.
Usually, running thing as root is bad, and we should figure out a way to not do it anymore
cc @seemethere @atalman @kit1980 @clee2000 @dbort @malfet
cc @atalman for NOVA workflows
@malfet: This sort of runner clean up shoud be done by the runner and not in the workflows, ARC should be able to do this cc @DanilBaibak @jeanschmidt
AI: To figure out a POC going forward for Nova workflow.
cc @atalman
https://github.com/pytorch/executorch/issues/3502 is caused by the hack I added to executorch to work around this "running as root" issue: buck2 does not like running as root.
This issue has been partially resolved now that we have a hook to chown everything back to the correct runner user.
The files are now chowned, but does the job still run with root as the user?
Checking with @atalman on this, and we have no plan to fix the run by root part atm. But, the files are now chowned to the correct user before and after the job finishes. I close the issue because there is already the buck workaround on ET. However, if you plan to remove that patch eventually, I could keep the issue open for a future fix.