Kevin Yin comments

Results 40 comments of


                                            Kevin Yin

Some testing from me

> This might be because internet access is not available? c4_mini is in the repo, whereas for c4 it tries to download from HF website. The node has connection to...

C4 HuggingFace issues are related to multi-GPU jobs in some way. Single GPU, works: [torchtitan_multi_node5885.txt](https://github.com/user-attachments/files/15979215/torchtitan_multi_node5885.txt) Multi GPU, errors: [torchtitan_multi_node5886.txt](https://github.com/user-attachments/files/15979286/torchtitan_multi_node5886.txt) I don't personally care about this HF issue, so it's up...

Support for `numpy>=1.25`

This issue is causing a failure to install on Ubuntu 24.04.

Logging has more overhead than expected

If I increase the metrics per step to 50-ish, ClearML becomes unusable: ``` Time: 0.0065 seconds Time: 0.1066 seconds Time: 0.1066 seconds Time: 0.0063 seconds Time: 0.1068 seconds Time: 0.1069...

Logging has more overhead than expected

When I add an extra `print("flushing")` to `def add_event(self, ev):` in `reporter.py`, the time spikes go away: ``` if self._queue_size >= self._flush_threshold: print("flushing") self.flush() ``` ``` Time: 0.0069 seconds Time:...

Logging has more overhead than expected

``` def flush(self): while isinstance(self._queue, PrQueue) and self._queue.is_pending(): sleep(0.1) ``` This explains the 0.1 sec delays. It's not from Python GC, it's ClearML's code. These delays block training synchronously, which...

Logging has more overhead than expected

I bisected the overhead to these lines: https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L379 costs 0.5 ms https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L138 costs 2.4 ms https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L160 costs 0.4 ms https://github.com/clearml/clearml/blob/342e1b35f8be532acdc27d74402482a4d67a19cf/clearml/backend_interface/metrics/reporter.py#L163 costs 0.12 ms, but sometimes 1000.12 ms My method was...