papermill
papermill copied to clipboard
Absolutely not thread-safe usage of cwd
This part of code is absolutely not thread-safe: https://github.com/nteract/papermill/blob/68580f79fd7cc037dd3908343e3a070c3140395b/papermill/execute.py#L91-L105
Why don't you use cwd
parameter in nbclient.client.execute
or at least copy this function?
https://github.com/jupyter/nbclient/blob/8d5fc8caa0b27b308760e5a350369720c03efc8b/nbclient/client.py#L730-L750
No need to get hostile. I happen to maintain both code bases, and at the time chdir was written in papermill the nbconvert (where nbclient came from), jupyter_client, and ipykernel dependencies were not thread or multi-process safe. Thus chdir here was written before any concurrency was possible and I haven't looked at it again since that was improved.
If I recall correctly the running dir setting in nbclient only sets the kernel's working directory (see https://github.com/jupyter/jupyter_client/blob/5.x/jupyter_client/launcher.py#L36-L37) and not the parent process where papermill has IO controls around fetching files. This chdir method at the time it was written was covering the input and output path directory management.
Given you are changing the path directory for the whole process, it would be challenging to both set the relative path for IO and make it thread safe for other IO operations. I see a couple options. 1) put a warning in the doc string for the cwd option indicating it's not thread safe, 2) perform all IO operations in isolated processes with their own working directories, or 3) compute relative paths for IO and rename IO paths using that. I'd vote for 1) and just let users be aware of the constraint. 2) has issues in that mixing multiprocessing and multithreading can lead to unrecoverable issues. 3) would technically work for components as written, but extensions would need to know that cwd isn't a actually changing the directory. I suppose one could also make a new option for recalculating IO paths without any process changes as a compromise for 3).
I'm not sure that input_path
, output_path
, stdout_file
, stderr_file
should take in account cwd
. I thought that the idea is that I get notebook from input_path
, run it as if it is in cwd
, save result to output_path
and so on. Maybe I misunderstood you? Why is it so important to start parent process in cwd
? What IO controls do you mean (is it something like stdout_file
)?
I think I understood what you mean. That's about this page of documentation: https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-new-i-o-handlers
Yes, though I re-looked at the code involved and I believe the cwd used to impact the developing-new-i-o-handlers
but hasn't since a refactor a while back. This actually only leaves https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-a-new-engine being affected by the cwd pattern. However, I doubt that is actually used enough in a manner where cwd would impact execution. Given that I'd actually be ok with changing the PapermillNotebookClient
accept cwd
and make execute_notebook_with_engine
clearly take a cwd for the engine to try and respect. You are right that this would help with thread safety in papermill by allow us to remove all os.chdir
calls (there's one other spot that doesn't need it as well).