nltk
nltk copied to clipboard
Write downloaded model files atomically
Avoiding scenarios where one processes may be reading a file that is being written by another one by only moving the file to the final location after it has been fully written.
@naktinis, does this PR solve a currently known problem? In particular, it would be great to know if it solves any of the many many open issues concerning NLTK download errors.
@ekaf this is an attempt to solve an issue we are having despite installing directly from develop branch (which already includes this unreleased fix: https://github.com/nltk/nltk/commit/69635b464f2adfa56e4f2ba2db7e2a72d11ac1ba).
The error we get almost every time is Error with downloaded zip file.
Our scenario is that we launch multiple processes simultaneously in an environment where there are no models downloaded yet.
So it appears the following sequence is quite likely in such circumstances (and my guess is that it's causing our error):
- process A checks if the model is downloaded – it is not, so clears the file and starts the download
- process B checks if the model is downloaded – it still is not, so clears the file and starts the download too
- process A finishes the download
- process A opens the model file – but process B is writing to it
- process A tries reading the content of the model file and fails because it is not fully downloaded
I couldn't see any directly related issues currenlty reported, but if you have any that seem relevant, I'm happy to take a look. I can also report this issue separately to have it associated with this PR if that's helpful.
@naktinis, your 5-point sequence above appears very plausible. However, if your assumption is correct, then we should expect this sequence to also have caused some of the many issues related to downloading and zip files, that have been reported over the past years. Searching for "download" in issues yields this long list. Most are closed, but problems keep returning mysteriously. It seems likely that some of them resemble your scenario, and it would be really great if your solution could finally fix it. Otherwise, it could indeed be quite helpful if you would consider opening a new issue to describe your specific problem, mentioning in particular which package you tried to install.
@ekaf I went through all open issues mentioning "download" and couldn't find anything obviously matching this, so I created an issue that includes a script to reproduce this: https://github.com/nltk/nltk/issues/3248.
Thanks @naktinis for the issue and the test script. Your scenario seems plausible, because the NLTK downloader dates back to a time where running multiple parallel processes was less common than now.
But given the large numbers of users who have had problems with downloading NLTK data packages, it seems surprising that none of them had similar errors. This would mean that you are downloading in a different way from everybody else, and I wonder what that could be?
Hopefully, some users with download problems will try your solution, and see if it works for them.
@naktinis, sorry for my now deleted suggestion by Copilot about a lock mechanism. It proved useless, and further dialog with Copilot was frustrating and a complete waste of time.
Once this logic is in place (ensuring atomic file moves), we could guarantee that the file in the final destination is written after complete download (and not some interrupdated half-download). Then we could have an if os.path.isfile(model_path) could give better guarantees that we don't need to retry the download.
The only theoretical issue would be if that complete download got corrupted in transit (network or disk errors), but then we could use the force=True or on this hopefully rare scenario simply clean up the model and retry.
Finally, if we want to avoid double downloads (if we start a service with multiple nltk processes that all try to download), we could introduce locks on the temporary file. We'd probably need some platform independent library like https://pypi.org/project/filelock/ (haven't tried this myself yet, only linux flock).
The real problem is to avoid multiple useless downloads of the same file, because script kids can use this vulnerability to launch a distributed attack on the download servers. Writing files atomically makes the problem worse, because it would be preferable that these useless downloads fail. So I am completeletely opposed to your solution, and would personally reject it. Maybe other NLTK developers have a different viewpoint though.
What I tried to explain in my comment above is that atomic writing can go together with locking. It would work like this:
- try to obtain a lock on
/model-dir/model.tmp - if the lock can not be obtained, wait for the lock to expire (after we eventually get the lock, check if the final target already has the file - meaning another process properly downloaded it there)
- if the lock can be obtained (assuming the final target does not exist yet), download to
/model-dir/model.tmp - when the download finishes atomically move
/model-dir/model.tmpto/model-dir/model
If for some reason thousands of concurrent workers are launched, step 2 should prevent simultaneous downloads.
Thanks @naktinis. Yes, I agree that once a working lock is in place to prevent useless multiple downloads, then writing files atomically could be a nice complement. But the first goal is to ensure that users only download any package once. I have tried various approaches to locking, but none that worked across multiple Processes or threads.
Have you tried https://pypi.org/project/filelock/ yet?
Documentation: https://py-filelock.readthedocs.io/en/latest/
Example (should work across processes; they even have a video demo in the documentation page):
from filelock import Timeout, FileLock
lock = FileLock("high_ground.txt.lock")
with lock:
with open("high_ground.txt", "a") as f:
f.write("You were the chosen one.")
Btw, I'm guessing it's fine to have the lock on the tmp file itself and we probably don't need an extra .lock file. As long as the tmp file name is shared across processes (and not uniquely generated by each process).
Thanks @naktinis, but I prefer not to add yet another dependency. Python already has an RLock mechanism that solves #3248, When you only download each file once, there is no race condition. Then I'm not sure that writing files atomically has any utility.
@naktinis, now #3327 uses a multiprocessing Manager dictionary of booleans to emulate locks. Would you consider that, to prevent your race conditions?
We had to stop using nltk in production. However, if the locking prevents failures with the test script here https://github.com/nltk/nltk/issues/3248 (assuming you can reproduce the failure with the code version before the fix), then it should be fine.
@naktinis, in the current NLTK version 3.9.1, the tokenizer no longer uses pickles, so the data package to download is now punkt_tab.
So your test script in #3248 first needs to be updated to download punkt_tab instead of punkt.
But even then, without a fix, race conditions cause a variable number of LookupErrors.
After PR #3327 however, there is no race condition because the package is only downloaded once.
Great! Then I guess the only other case is if the downloading processes are not launched with multiprocessing from the same parent process but independently from the operating system. But if this already solves a class of problems, then may be good to go.
Thanks @naktinis! Indeed, in case the parallel processes are launched directly from the OS, it would be safer to write the files atomically. Or maybe some robust filelock can be implemented.