papermill icon indicating copy to clipboard operation
papermill copied to clipboard

Absolutely not thread-safe usage of cwd

Open CrafterKolyan opened this issue 5 years ago • 4 comments

This part of code is absolutely not thread-safe: https://github.com/nteract/papermill/blob/68580f79fd7cc037dd3908343e3a070c3140395b/papermill/execute.py#L91-L105

Why don't you use cwd parameter in nbclient.client.execute or at least copy this function? https://github.com/jupyter/nbclient/blob/8d5fc8caa0b27b308760e5a350369720c03efc8b/nbclient/client.py#L730-L750

CrafterKolyan avatar Feb 16 '20 21:02 CrafterKolyan

No need to get hostile. I happen to maintain both code bases, and at the time chdir was written in papermill the nbconvert (where nbclient came from), jupyter_client, and ipykernel dependencies were not thread or multi-process safe. Thus chdir here was written before any concurrency was possible and I haven't looked at it again since that was improved.

If I recall correctly the running dir setting in nbclient only sets the kernel's working directory (see https://github.com/jupyter/jupyter_client/blob/5.x/jupyter_client/launcher.py#L36-L37) and not the parent process where papermill has IO controls around fetching files. This chdir method at the time it was written was covering the input and output path directory management.

Given you are changing the path directory for the whole process, it would be challenging to both set the relative path for IO and make it thread safe for other IO operations. I see a couple options. 1) put a warning in the doc string for the cwd option indicating it's not thread safe, 2) perform all IO operations in isolated processes with their own working directories, or 3) compute relative paths for IO and rename IO paths using that. I'd vote for 1) and just let users be aware of the constraint. 2) has issues in that mixing multiprocessing and multithreading can lead to unrecoverable issues. 3) would technically work for components as written, but extensions would need to know that cwd isn't a actually changing the directory. I suppose one could also make a new option for recalculating IO paths without any process changes as a compromise for 3).

MSeal avatar Feb 16 '20 22:02 MSeal

I'm not sure that input_path, output_path, stdout_file, stderr_file should take in account cwd. I thought that the idea is that I get notebook from input_path, run it as if it is in cwd, save result to output_path and so on. Maybe I misunderstood you? Why is it so important to start parent process in cwd? What IO controls do you mean (is it something like stdout_file)?

CrafterKolyan avatar Feb 16 '20 22:02 CrafterKolyan

I think I understood what you mean. That's about this page of documentation: https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-new-i-o-handlers

CrafterKolyan avatar Feb 17 '20 07:02 CrafterKolyan

Yes, though I re-looked at the code involved and I believe the cwd used to impact the developing-new-i-o-handlers but hasn't since a refactor a while back. This actually only leaves https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-a-new-engine being affected by the cwd pattern. However, I doubt that is actually used enough in a manner where cwd would impact execution. Given that I'd actually be ok with changing the PapermillNotebookClient accept cwd and make execute_notebook_with_engine clearly take a cwd for the engine to try and respect. You are right that this would help with thread safety in papermill by allow us to remove all os.chdir calls (there's one other spot that doesn't need it as well).

MSeal avatar Feb 17 '20 19:02 MSeal