Martin.B comments

Results 144 comments of


                                            Martin.B

k8s glue: clearml_agent: ERROR: Execution required enqueued task

> and the failed pod keeps on trying to connect while the clearml task has failed: Can you verify that the pod cmd ended with `; exit 0`. This means...

k8s glue: clearml_agent: ERROR: Execution required enqueued task

Wait now I'm confused, if it says `Reason: Completed` why did it restart the pod ? In other words, what's the difference between completed run and aborted/failed one from the...

k8s glue: clearml_agent: ERROR: Execution required enqueued task

Hi @Shaked Apologies for the delayed reply here, great news that it worked! > I do want to note one thing: this works because the scheduler creates a Kind: Pod....

Pull task only if GPU is free

Hi @klekass > This means, that a GPU might be in use while an agent is trying to run the assigned task, resulting in task failure. Maybe the developers could...

Pull task only if GPU is free

> My suggestion is, that instead of immediately running the experiment on GPU 0, when the previous task is finished, the agent should poll the GPU and wait until enough...

Pull task only if GPU is free

> Maybe a combination of both would be ideal? I.e. Using a lock file, such that the developer can make sure, his GPU is reserved and using the GPU check,...

Horovod installation default settings causes environment problems

Hi @Mert-Ergin Yes `horovod` is one of the special cases in *trains-agent*. Like git based pip installs `horovod` will be installed **last**, meaning after all the packages are installed. The...

Horovod installation default settings causes environment problems

Hi @Mert-Ergin If you are running *trains-agent* in docker mode, the easiest is to build a docker with horovod (or take one of the pre-built once, they have them for...

Does trains-agent caches experiments envs?

Hi @H4dr1en I can definitely feel you on this one :) So we used to use [venv_update](https://github.com/Yelp/venv-update) , in theory you can still try to [use it](https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L75) (but I have...

Does trains-agent caches experiments envs?

Hi @H4dr1en I think that "Proposal 2" is something you can already achieve. This is basically building a docker , and using it as the base docker image. ```bash trains-agent...