clearml icon indicating copy to clipboard operation
clearml copied to clipboard

daemon stops functioning

Open ibobak opened this issue 2 years ago • 5 comments

Guys,

at certain point of time daemon() method just locks and stops working in reporter.py:

   while not self._event.wait(0):
            self._flush_event.wait(self._wait_timeout)
            self._flush_event.clear()
            # lock state
            self._res_waiting.acquire()
            self._write()
            # wait for all reports
            if self.get_num_results() > 0:
                self.wait_for_results()
            # set empty flag only if we are not waiting for exit signal
            if not self._event.wait(0):
                self._empty_state_event.set()
            # unlock state
            self._res_waiting.release()

I've put breakpoints into this loop, attached to process. My notebook writes lots of matplotlib plots, but at some plot something happens and this loop just doesn't work - no breakpoint is hit any more.

I spent a day trying to understand what the hell is there, but I really have no more time to struggle with this, so I am giving up and switching on another product.

ibobak avatar Apr 05 '22 09:04 ibobak

Hi @ibobak,

What framework are you using? is it multi-process?

jkhenning avatar Apr 05 '22 10:04 jkhenning

I don't know what do you mean by this. I clearly see two processes (one is the process of my notebook and the other one is fork from the notebook which appears after task initialization). I attached my debugger to both. Then I tried to understand what is happening and why after Nth report of matplotlib plot NOTHING is written to the ClearML server. Just NOTHING. By my observation, "while not self._event.wait(0)" - this thing locks forever.

I really had to give up and set up mlflow, because too much time was spent on debugging of ClearML. I liked your product, but this instability makes it impossible to use in my company.

ibobak avatar Apr 05 '22 16:04 ibobak

@ibobak are you using pytorch by any chance?

jkhenning avatar Apr 05 '22 19:04 jkhenning

I am not using it. I am using LightGBM, but clearml stops functioning still before any lightgbm code is working. It breaks on reporting a plot of this kind. And after this plot is doesn't report anything any more.

image

ibobak avatar Apr 07 '22 10:04 ibobak

Do you have any sample code that can reproduce it? Up until now we've not been able to reproduce the issue Any detail will help - OS, Python version, ClearML SDK version etc.

jkhenning avatar Apr 07 '22 19:04 jkhenning