Kevin Yin
Kevin Yin
> This might be because internet access is not available? c4_mini is in the repo, whereas for c4 it tries to download from HF website. The node has connection to...
C4 HuggingFace issues are related to multi-GPU jobs in some way. Single GPU, works: [torchtitan_multi_node5885.txt](https://github.com/user-attachments/files/15979215/torchtitan_multi_node5885.txt) Multi GPU, errors: [torchtitan_multi_node5886.txt](https://github.com/user-attachments/files/15979286/torchtitan_multi_node5886.txt) I don't personally care about this HF issue, so it's up...
This issue is causing a failure to install on Ubuntu 24.04.
If I increase the metrics per step to 50-ish, ClearML becomes unusable: ``` Time: 0.0065 seconds Time: 0.1066 seconds Time: 0.1066 seconds Time: 0.0063 seconds Time: 0.1068 seconds Time: 0.1069...
When I add an extra `print("flushing")` to `def add_event(self, ev):` in `reporter.py`, the time spikes go away: ``` if self._queue_size >= self._flush_threshold: print("flushing") self.flush() ``` ``` Time: 0.0069 seconds Time:...
``` def flush(self): while isinstance(self._queue, PrQueue) and self._queue.is_pending(): sleep(0.1) ``` This explains the 0.1 sec delays. It's not from Python GC, it's ClearML's code. These delays block training synchronously, which...
I bisected the overhead to these lines: https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L379 costs 0.5 ms https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L138 costs 2.4 ms https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L160 costs 0.4 ms https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L163 costs 0.12 ms, but sometimes 1000.12 ms My method was...
``` sdk.development.report_use_subprocess = False ``` This works. Time went from 6.5 ms to 0.75 ms, which is 9x faster. Skipping the `_fast_is_subprocess_alive` check is key. Overhead is down to 4%...
After Jake's mitigation, ClearML's overhead in a full-size transformer is 5.77 ms, which is 0.3%. This is much better. Checking some of the remaining time sources: 3.57 ms is from...
Similarly, I would like to hide a Title group for all runs in a project when I hide that Title group for one run, because I am looking for the...