clearml
clearml copied to clipboard
daemon stops functioning
Guys,
at certain point of time daemon() method just locks and stops working in reporter.py:
while not self._event.wait(0):
self._flush_event.wait(self._wait_timeout)
self._flush_event.clear()
# lock state
self._res_waiting.acquire()
self._write()
# wait for all reports
if self.get_num_results() > 0:
self.wait_for_results()
# set empty flag only if we are not waiting for exit signal
if not self._event.wait(0):
self._empty_state_event.set()
# unlock state
self._res_waiting.release()
I've put breakpoints into this loop, attached to process. My notebook writes lots of matplotlib plots, but at some plot something happens and this loop just doesn't work - no breakpoint is hit any more.
I spent a day trying to understand what the hell is there, but I really have no more time to struggle with this, so I am giving up and switching on another product.
Hi @ibobak,
What framework are you using? is it multi-process?
I don't know what do you mean by this. I clearly see two processes (one is the process of my notebook and the other one is fork from the notebook which appears after task initialization). I attached my debugger to both. Then I tried to understand what is happening and why after Nth report of matplotlib plot NOTHING is written to the ClearML server. Just NOTHING. By my observation, "while not self._event.wait(0)" - this thing locks forever.
I really had to give up and set up mlflow, because too much time was spent on debugging of ClearML. I liked your product, but this instability makes it impossible to use in my company.
@ibobak are you using pytorch by any chance?
I am not using it. I am using LightGBM, but clearml stops functioning still before any lightgbm code is working. It breaks on reporting a plot of this kind. And after this plot is doesn't report anything any more.
Do you have any sample code that can reproduce it? Up until now we've not been able to reproduce the issue Any detail will help - OS, Python version, ClearML SDK version etc.