clearml
clearml copied to clipboard
Deadlocks happening with version `1.4.0`
After upgrading clearml
to 1.4.0,
our code started producing deadlocks that had not been happening before.
The part of our code that was caught in a deadlock was inside multiprocessing.Pool()
.
We used Python3.8 and Tensorflow2.8. We use ClearML
for experiment tracking only. We create Task
objects manually and report scalars using clearml.Logger.report_scalar()
.
Disabling clearml
solved the issue. We pinpointed https://github.com/allegroai/clearml/commit/7625de3f2fec0eb641024ce7ca70a7d31083fa23 as the culprit. Downgrading to the commit before solved the problem.
We noticed that you patched the OS forking 👇 , which may cause weird issues like ours. https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/binding/environ_bind.py#L61
After commenting out this line, https://github.com/allegroai/clearml/blob/bca9a6de3095f411ae5b766d00967535a13e8401/clearml/task.py#L628-L629 the deadlock did not happen anymore. Disclaimer, we did not test this. The purpose was to get a quick feeling if this hypothesis has merit.
You probably have reasons for pathing the os forking mechanism; however, from reading the comments in that part of the code, it is not entirely clear what root cause you're solving.
Hi @alekpikl,
You're absolutely right - our apologies 🙏 🙁 We do have an internal test that should have caught it... I'll check why our automated testing pipeline failed to detect that.
We'll release an RC as soon as possible, will update here
How important is os.fork patching for experiment tracking? Would it be feasible to have an option to disable it, perhaps at the cost of some functionality? Just a thought but I feel like environments and tools like e. g. tensorflow are already quite fragile and patching low-level OS stuff is somewhat dangerous, we would like to avoid it unless it's really needed
Hi @alekpikl, @vlad-ivanov-name,
Updating that a new ClearML SDK v1.4.1rc0 is out with a fix for this issue, please update if the issue is resolved 🙏
Hi @vlad-ivanov-name
How important is os.fork patching for experiment tracking?
In short very 🙂 The issue is that without addressing the fork issue, all the "original" threads and thread queues/locks are not actually available, which means that first that are not to be used, and second data still needs to flow to the subprocess which actually does the reporting / uploading.
Would it be feasible to have an option to disable it, perhaps at the cost of some functionality?
In theory yes it would be possible, but than we loos all logging capabilities from sub-processes (for example sub-process writing to console will not be logged)
Just a thought but I feel like environments and tools like e. g. tensorflow are already quite fragile ...
Actually from our experience, the fragile part is actually python... but your point stands. Notice that the TF subprocesses are not effected by the patched forked (they are in most cases low level C++ subprocesses , so no python code can actually make a difference there). PyTorch DataLoader was the main reason we started optimizing for forked processes, because these are python subprocesses and you do want the ability to have access to the logging capabilities from these subprocesses.
Does that make sense ?
@bmartinn Thank you for explaining! So would it be correct to say this patch exists to capture stdout/stderr of forked processes? I wonder if it could be achieved by creating a pipe()
and dup2()
on stdout file descriptor (although it could be that python already does that).
What I'm not entirely sure about is the origin of something that might be a race condition. Before the change in question, this function had a sleep()
call:
https://github.com/allegroai/clearml/blob/7b5c676ab270b959e61d072118e6f5d076001ee6/clearml/binding/environ_bind.py#L102
Now there's an exception catch-all clause and it just feels a bit dangerous.
I do however understand that this is already going a bit too deep into implementation, so thanks again for the explanation and we'll see on our side how important that kind of logging is for us. Worst case scenario -- maybe we can just "unpatch" fork call :)
Updating that a new ClearML SDK v1.4.1rc0 is out with a fix for this issue, please update if the issue is resolved 🙏
This resolves the issue. Thank you for the fast response and the fix! 🙏
Quick update, v1.4.1 is out with a fix 😄 pip install clearml==1.4.1
@vlad-ivanov-name
So would it be correct to say this patch exists to capture stdout/stderr of forked processes?
Actually it is there to support using the Task
object from the forked process. Python is not very good with subprocesses, basically it's on the user to do any bookkeeping on background threads running on the main process that will Not be replicated into the forked process. This means that for example the background-reporting-thread needs to be re-created on the forked process. These kind of things could be solved with lazy loading only when a new report is generated (i.e. console reporting, or event reporting), see below for more on that.
wonder if it could be achieved by creating
Actually this is not really needed, because the same "hook" on the main process will be replicated into the subprocess, the missing part is actually sending the data to the server (which is always done in a background thread/process, see above answer)
I'm hoping that we will be able to remove the need for patching the fork call, we are now testings a new "deferred Task init" call, that basically does the initial handshaking in the background, the by product of that is that we will be able to "lazy load" the missing background-report-thread (see above) on the forked process, Without the need patch the fork call.
I hope this helps, this is a complicated flow with lots of ins and outs, I'm hoping this makes at least a bit clearer (pun intended 😃 )
@bmartinn I believe we got this issue again in version 1.6.3. Is it possible? (removed the clearml functionality from the function and it works as expected)
Hi @tomer-grin ! We weren't able to reproduce this issue yet. Do you have any code snippet that could cause a deadlock? Thanks!
@morsinai13 Yes,
refocus_frame_and_save_patial = partial(refocus_frame_and_save, output_dump_subfolder, model_path, kwargs)
with Pool(min(len(summary_sharp_refocus_json['frames']), os.cpu_count())) as pool:
results = pool.map(refocus_frame_and_save_patial, summary_sharp_refocus_json['frames'])
refocus_frame_and_save
Contains Logger
class calls and methods (report_vector
, report_image
etc..)
It is stateful, we cannot actually reproduce it every time.
Closing this as it was already fixed in released versions. Please reopen if required.