moon
moon copied to clipboard
[bug] sporadic fs::create error on high concurrency
Describe the bug
Our solution includes around 20 Moon projects, and occasionally the build fails with the following error
app349:build | Error: fs::create
app349:build | × Failed to create C:\Development\stressmoon\.moon\cache\locks\sync-project-
app349:build | │ root.lock.
app349:build | ╰─▶ Access is denied. (os error 5)
So far, we've only observed this error on a build machine (Windows 10, 64 CPU cores). It's much harder to reproduce locally, which makes it more challenging to diagnose and fix.
To help reproduce the issue, I created a synthetic Moon solution: https://github.com/azabluda/stress-moon. It contains a few dozen trivial projects. Hopefully, you have access to a Windows machine to test this on?
Steps to reproduce
- Clone https://github.com/azabluda/stress-moon
moon run :build- Repeat if it doesn't fail the first time
Expected behavior
A clean build that exits with code 0..
Screenshots
Environment
System:
OS: Windows 10 10.0.19045
CPU: (28) x64 13th Gen Intel(R) Core(TM) i7-13850HX
Memory: 16.28 GB / 31.69 GB
Binaries:
Node: 22.16.0 - ~\AppData\Local\nvs\node\22.16.0\x64\node.EXE
npm: 10.9.2 - ~\AppData\Local\nvs\node\22.16.0\x64\npm.CMD
Managers:
Cargo: 1.82.0 - ~\.cargo\bin\cargo.EXE
pip3: 24.3.1 - ~\AppData\Local\Programs\Python\Python311\Scripts\pip3.EXE
Utilities:
Git: 2.49.0.
Curl: 8.9.1 - C:\Windows\system32\curl.EXE
Virtualization:
Docker: 27.3.1 - C:\Program Files\Docker\Docker\resources\bin\docker.EXE
Docker Compose: 2.29.7 - C:\Program Files\Docker\Docker\resources\bin\docker-compose.EXE
IDEs:
VSCode: 1.100.3 - C:\Users\zablale\AppData\Local\Programs\Microsoft VS Code\bin\code.CMD
Visual Studio: 17.14.36109.1 (Visual Studio Enterprise 2022)
Languages:
Python: 3.11.9
Rust: 1.82.0
Browsers:
Edge: Chromium (136.0.3240.76)
Internet Explorer: 11.0.19041.5794
Additional context
I looked into the code for a possible root cause. It seems the file-based locking mechanism used here https://github.com/moonrepo/starbase/blob/756c9b8756d3eaf78a56f1b5ae8e0684fb5d4ac0/crates/utils/src/fs_lock.rs#L22-L30 is not very water-tight. Two processes might reach the file creation step simultaneously, which could obviously cause one of them to fail.
Some LLMs suggest using Named Mutexes on Windows for interprocess synchronization instead of file-based locks.
We use file locks because named mutexes in the same process don't work when multiple moon commands have been spawned and are running parallel to each other. This is something that happens quite often in CI environments.
As for this issue, I'll try and get around to it sometime this week but I'll be out of town over the weekend. Regardless, is the error always for sync-project-root.lock? Or does the project always change?
This is weird because each project should have it's own lock, and it only even runs once, so I'm not sure how the collision is happening.
This morning I didn’t observe any failures related to sync-project-root.lock. Instead, all the errors were for individual appXXX.lock files. In one of the runs, I even saw a whole batch of them fail concurrently
app166:build | Error: fs::create
app166:build | × Failed to create C:\Development\stressmoon\.moon\cache\locks\sync-project-
app166:build | │ app355.lock.
app166:build | ╰─▶ Access is denied. (os error 5)
▪▪▪▪ app166:build (11s 952ms)
▪▪▪▪ app317:build
app258:build | Error: fs::create
app258:build | × Failed to create C:\Development\stressmoon\.moon\cache\locks\sync-project-
app258:build | │ app233.lock.
app258:build | ╰─▶ Access is denied. (os error 5)
▪▪▪▪ app258:build (12s 212ms)
▪▪▪▪ app282:build
app215:build | Error: fs::create
app215:build | × Failed to create C:\Development\stressmoon\.moon\cache\locks\sync-project-
app215:build | │ app292.lock.
app215:build | ╰─▶ Access is denied. (os error 5)
▪▪▪▪ app215:build (13s 206ms)
▪▪▪▪ app287:build
app232:build | Error: fs::create
app232:build | × Failed to create C:\Development\stressmoon\.moon\cache\locks\sync-project-
app232:build | │ app272.lock.
app232:build | ╰─▶ Access is denied. (os error 5)
Error: task_runner::run_failed
× Task app166:build failed to run.
╰─▶ Process moon failed: exit code 1
The problem might be explained by the fact that each appXXX:build runs moon :run top:print in a separate process. My synthetic setup forms an ultra wide diamond shape — all apps converge on the same top-level task.
Made some tweaks in v1.37.2, give that a try.
With that said, running nested moon commands has a chance of collisions. You should probably run moon run with --no-actions for the nested command.
Have you had a chance to try this?
Yes, we're testing this in stages. So far, we have only added the --no-actions flag without updating our Moon version.
This has significantly improved the stability of our CI/CD builds, just as you suggested. Across hundreds of runs, we've only observed one unexplained fs::create error, this time for proto-install.lock.
While this could still be a configuration issue on our end, could you please confirm if the proto-install.lock logic respects the --no-actions flag?
Our next step is to update to Moon v1.37.2+ in our pipelines. Given the significant improvement, please feel free to close this issue. If any problems arise after we update, I will open a new one.
The proto install does not happen in an action but happens during boot, so that one is unavoidable.
However, you could pre-install proto before running moon commands to avoid the checks.
I wonder if we even need proto in Tier 0?
@azabluda proto only gets installed if it's actually going to be used. what's your toolchain config look like?
@milesj we don't have a toolchain.yml. Basically we are just calling powershell commands to build our application.
Ah interesting, looks like a bug. We acquire the lock even if we don't need to install proto: https://github.com/moonrepo/moon/blob/master/crates/app/src/systems/analyze.rs#L81
Easy fix.
Tested long enough with 1.38.6. No more locking issues! Good job @milesj 👍