clearml icon indicating copy to clipboard operation
clearml copied to clipboard

[OSError] no space left on device

Open Waerden001 opened this issue 2 years ago • 3 comments

After running a multi-step pipeline successfully, I rerun it with zero code change, however one (training) step (which invloves multiprocessing) throws the [OSError] no space left on device error message. I solved the problem by deleting files generated by ClearML in the \tmp folder. If this an expected behavior of ClearML Pipeline, is there a way to avoid this overhead?

Waerden001 avatar Mar 29 '22 13:03 Waerden001

Hi @Waerden001 ,

What files did you delete exactly?

jkhenning avatar Mar 31 '22 08:03 jkhenning

@jkhenning I am also facing similar issue. I suspect the setting up of the container, and the running of applications may have written to the pod's /tmp. From what I read, /tmp is default on tmpfs, thus limit to memory resource of the node.

Would mounting /tmp to emptydir volume helps? Currently, I have not found a way to add to the clearml-agent running on k8s. Any advice?

okyspace avatar Sep 16 '23 20:09 okyspace

Hi @okyspace , yes, I assume mounting /tmp would help. I suspect his is the agent's log being written there

jkhenning avatar Sep 18 '23 07:09 jkhenning