Martin.B
Martin.B
> and the failed pod keeps on trying to connect while the clearml task has failed: Can you verify that the pod cmd ended with `; exit 0`. This means...
Wait now I'm confused, if it says `Reason: Completed` why did it restart the pod ? In other words, what's the difference between completed run and aborted/failed one from the...
Hi @Shaked Apologies for the delayed reply here, great news that it worked! > I do want to note one thing: this works because the scheduler creates a Kind: Pod....
Hi @klekass > This means, that a GPU might be in use while an agent is trying to run the assigned task, resulting in task failure. Maybe the developers could...
> My suggestion is, that instead of immediately running the experiment on GPU 0, when the previous task is finished, the agent should poll the GPU and wait until enough...
> Maybe a combination of both would be ideal? I.e. Using a lock file, such that the developer can make sure, his GPU is reserved and using the GPU check,...
Hi @Mert-Ergin Yes `horovod` is one of the special cases in *trains-agent*. Like git based pip installs `horovod` will be installed **last**, meaning after all the packages are installed. The...
Hi @Mert-Ergin If you are running *trains-agent* in docker mode, the easiest is to build a docker with horovod (or take one of the pre-built once, they have them for...
Hi @H4dr1en I can definitely feel you on this one :) So we used to use [venv_update](https://github.com/Yelp/venv-update) , in theory you can still try to [use it](https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L75) (but I have...
Hi @H4dr1en I think that "Proposal 2" is something you can already achieve. This is basically building a docker , and using it as the base docker image. ```bash trains-agent...