clearml Data corruption on read/write disk

Data corruption on read/write disk

Open ViisightsMoshe opened this issue 2 years ago • 4 comments

Hi everyone, I was wondering if anyone encountered this behavior while working with ClearML over docker containers. This may not be directly related to ClearML.

Machine 1: Local server, I am the sole user. ClearML running on the host machine, docker container used for GPU related actions.
After about a year of use I switched to using ClearML for job management. Shortly after the disk I use for read/write failed with block corruption. fsck solved this.

Machine 2: AWS Machine with four GPUs. The setup is such that there are 4 containers, each with its own GPU visibility, each running its own clearml agent listening to the same job queue. I am the sole user. This setup was used for a long time, except maybe for an image update for the working containers. After setting multiple experiments (~600) on the queue and letting it run overnight, I get the same corrupted file system error on the external, large, disk. xfs_repair solved this, but it recurred.

Has anyone encountered this while using ClearML? While using docker containers?

Thanks.

Feb 15 '22 09:02 ViisightsMoshe

Hi @ViisightsMoshe ,

I have to admit I've never seen such a behavior. This is with ClearML Agent, right (not ClearML Server, just making sure). Are you using the ClearML Agent in docker mode? Also, do you know where the corruption occurred (i.e. what was corrupted and fixed?)

Feb 15 '22 11:02 jkhenning

Hi @jkhenning, thank you for the quick reply. The two machines that failed were both running clearml-agent. On machine 1 clearml-agent was running in docker mode. On machine 2 clearml-agent was running inside each docker, and was running on GPU 0 of the docker (while the docker has visibility to a particular GPU {0..3}). I took care to give each agent unique folders in the clearml.conf file.

On machine 1 the fsck gave some indication of deleted/unused inodes in docker overlay2 paths. The disk was replaced (old SSD drive so its for the best). On machine 2 xfs_repair gave no indication and I do not see any sign of actual corruption after the repair.

Feb 15 '22 11:02 ViisightsMoshe

Well, from my experience this is something that I witnessed on occasion on AWS EC2 (with different usages, not specific to ClearML server or agent), but certainly not in a repeatable fashion (usually when running lots of machines, which simply increases the likelihood of encountering a HW error)...

Feb 15 '22 11:02 jkhenning

After leaving it for a while and returning for debugging I managed to make some progress. I write down my investigation for anyone who encounters this in the future, and some further issues left to solve.

In the first days of migrating to clearml I accidently set two workers to use the same directory (i.e. /large_disk/.clearml/w0). After a few attempts of working in this configuration i caught the error and corrected it. This was, as far as I remember, before any disk corruption., but it is fair to assume that two workers writing to the same location may have corrupted files.
When continuing to en-masse experiment running with clearml, I had a consistent failure: running multiple experiments results, after some large number experiments, in disk corruption. The symptom is "input/output error" when doing even the most basic operation on the disk like ls, and it was reversible with "xfs_repair" in my case.
This only happened when using the "bad" worker from the first bullet.
In the first experiment of the lot that failed I noticed it always does so after printing the clearml configurations, i.e. "docker_cmd, entry_point, working_dir = ."
Tracking the clearml_agent code took me to the FolderCache.copy_cached_entry() function, which copies a cached venv to the "venv-builds" folder.
Sure enough, I had a suspicious file there: a lock file that gave me "permission denied" when trying to grep it. Maybe copying corrupted files throws the file system off?
Removing it appears to have solved my problem.

What baffles me now is why this happened on the ~100th experiment. Looking into it I realized that after some experiments are completed, the execution hangs on the clearml config ("working_dir = ." has been echoed) for minutes and maybe tens of minutes before continuing to echo "::: Using Cached environment" (i.e. after worker._session.print_configuration() and before worker.install_virtualenv() in worker.execute()). Watching the "du" of the /large_disk/.clearml/w0/venv-builds/3.8/ folder confirms that between experiments it is removing the venv and creating it back again. Looking at the GPU utilization graph shows a long hang time between experiments, starting only after some ~2h of running experiments from a large queue:

Currently it is working, though the throughput is reduces., so what is left to discover is why are the experiments after 2h "special".

Mar 09 '22 17:03 ViisightsMoshe

clearml clearml copied to clipboard

Data corruption on read/write disk

clearml
clearml copied to clipboard