cli
cli copied to clipboard
Fix "Access is Denied" error on windows (with lock file)
- What I did
Fix a bug where, if one process is reading a meta.json
on Windows, another CreateOrUpdate
will fail with Access is Denied
.
Example error:
rename C:\Users\docker\.docker\contexts\meta\<id>\.tmp-meta.json2945721760 C:\Users\docker\.docker\contexts\meta\<id>\meta.json: Access is denied.
- How I did it
Previously the code attempts to use moby's ioutils.AtomicWriteFile (called from here) to update meta.json
but this fails on Windows calling os.Rename if another process has the meta.json
open for reading (unlike on Unix OSes).
Avoid this problem by serialising accesses to the context filesystem db using a lock file via github.com/gofrs/flock which uses LockFileEx on Windows and syscall.Flock on Unix.
The lock is associated with a file descriptor / handle and released by calling Close()
or when the process exits. The Microsoft docs have:
If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system.
and man 2 flock has:
Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.
Note the Unlock() function returns an error which means that functions which could previously fail for one reason can now fail for two (the original + a lock/unlock failure). Use github.com/hashicorp/go-multierror to return all the errors. When we upgrade to go 1.20 we can use errors.Join.
- How to verify it
I saw this in CI where one process was running a docker
command which read a meta.json
while another was trying to update the meta.json
. I've just seen it in a Docker Desktop bug report this morning.
- Description for the changelog
- Fix race on Windows which could cause concurrent context access to fail with a transient
Access is Denied
.
- A picture of a cute animal (not mandatory but encouraged)
Codecov Report
Merging #4725 (ab429fe) into master (70d01b9) will decrease coverage by
0.12%
. Report is 10 commits behind head on master. The diff coverage is29.76%
.
Additional details and impacted files
@@ Coverage Diff @@
## master #4725 +/- ##
==========================================
- Coverage 59.66% 59.55% -0.12%
==========================================
Files 287 287
Lines 24760 24831 +71
==========================================
+ Hits 14774 14788 +14
- Misses 9100 9138 +38
- Partials 886 905 +19
Previously the code attempts to use moby's ioutils.AtomicWriteFile (called from here) to update meta.json but this fails on Windows calling os.Rename if another process has the meta.json open for reading (unlike on Unix OSes).
Didn't look at the code yet, but do you think this is something we should have as part of that package? (i.e., would other uses of the package have the same issue?)
Previously the code attempts to use moby's ioutils.AtomicWriteFile (called from here) to update meta.json but this fails on Windows calling os.Rename if another process has the meta.json open for reading (unlike on Unix OSes).
Didn't look at the code yet, but do you think this is something we should have as part of that package? (i.e., would other uses of the package have the same issue?)
I think other users of the package on Windows will definitely have the same issue. A colleague (@fredericdalleau ) separately suggested a possibility which is to use a different file sharing mode when opening the file on Windows, to permit the rename on top. This is the approach we use in Docker Desktop's "grpcfuse" filesharing to make Windows file I/O more like Unix file I/O. I'll take a look at that possibility too.
A colleague (@fredericdalleau ) separately suggested a possibility which is to use a different file sharing mode when opening the file on Windows, to permit the rename on top.
For others; this touched on the FILE_SHARE_DELETE
discussion (again); which was a long discussion with the Go maintainers where they disregarded the Windows kernel team recommendations, and went their own way, and suggested “Just fork the code in your project” :disappointed: https://github.com/golang/go/issues/32088 https://github.com/golang/go/issues/32088#issuecomment-538704393 (also https://github.com/golang/go/issues/34681)
Which ... is what we ended up doing in the docker daemon 😞 😞 😞 😞 ; https://github.com/moby/moby/blob/199793350807908c04b56741b11e6b943158ac51/daemon/logger/loggerutils/file_windows.go#L25-L72
Thanks for the FILE_SHARE_DELETE
links. To my surprise it doesn't seem to be enough for this case. In this branch I open the file with the delete flag, but os.Rename
still fails. The only new thing I can do is delete the file and then write a new one, but this won't be atomic so any readers will see a transient "file not found" exception.
Have a look at this unit test in that other branch: https://github.com/djs55/cli/commit/81b67b23a1817257445b5583f6abb565dd58cb9b
I think the only way to have clients see the Unix behaviour on Windows is to either serialise calls with a lock(file) or handle the transient errors with a sleep/retry loop. In the current code clients will see transient "Access is Denied" if there is a concurrent read; if we use FILE_SHARE_DELETE
+ os.Unlink
+ os.Rename
instead then clients will see transient "File not found".
IIUC, the general issue is (potentially silly comments ahead);
- this problem occurred when concurrent processes manipulate the context storage
- :point_up: this definitely is an issue in general; much of this code was written with the assumption that there's only a single, short-lived "cli" process, but it does not take concurrent instances into account
- :point_up: adding to this is the unfortunate situation where CLI plugins may also manipulate (or read) contexts. In this case "both" the parent process (the "CLI") and a plugin (say "buildx") could try to manipulate (or read) these locations on their own
- :point_up: for the above situation, we have some work to do, not just for that case, but also to prevent all plugins to have to re-implement the context-management logic; we've been contemplating the "main" CLI to act as a short-lived "daemon" (for the duration of the CLI), and expose a socket that plugins can connect to; this socket could provide an API (plugin asks CLI for context information, or to update/create/manipulate; same for things like
config
etc); Initial (very early/partial) building-blocks for this were added in @laurazard's PR for handling signals/termination; https://github.com/docker/cli/pull/4599, but no further work has been done (yet).
- I'm curious if this issue only happens when manipulating contexts, or also on concurrent read operations (two CLI instances running
docker context ls
in parallel)? - Having a "lock" file could definitely help here, but I also have some initial concerns;
- Should the CLI be able to run on a read-only filesystem? (I somewhat expect that to be an option); in that case we may need handling to prevent "failing to write lock-file" from causing a failiure (we could fallback to old behavior, and just go ahead)
- What's the risk of lock-files being left behind, and rendering the CLI broken?
- As we now need to write a lock-file for any operation (including "show context", "list contexts"); what's the additional overhead? (Assuming read operations to be a lot more common than write)
- Could we use the metadata-files themselves as "lock"? i.e. keep a lock on the file when starting a write-operation? (I guess this may be partially your comment on "retry")