clearml-agent icon indicating copy to clipboard operation
clearml-agent copied to clipboard

Pull task only if GPU is free

Open klekass opened this issue 3 years ago • 6 comments

I am running several agents, with each agent assigned one GPU. When an agent is available, it pulls a task from the queue and runs the experiment, which works perfectly. In my working envirnoment, however, we are using multiple GPU machines, which are used by multiple poeple, and we are not only using clearml for GPU orchestration. This means, that a GPU might be in use while an agent is trying to run the assigned task, resulting in task failure. To be able to use our GPU resources efficiently, it would be nice if the agent would pull a task only if its assigned GPU (or a minimum amount of memory) is free. Is it possible to configure the agent in such a way?

klekass avatar Jul 22 '21 13:07 klekass

Hi @klekass

This means, that a GPU might be in use while an agent is trying to run the assigned task, resulting in task failure.

Maybe the developers could use clearml-session to allocate a development environment on the GPU? This will not only allow to "lock" the GPU for development, but helps in working remotely with full JupyerLab/VSCode/SSH support into any container. wdyt ?

it would be nice if the agent would pull a task only if its assigned GPU (or a minimum amount of memory) is free. Is it possible to configure the agent in such a way?

My main fear is that this is not stable, for example, if the agent tests for GPU free ram , and we just killed the development process (i.e. all GPU ram is free), the agent will pick the Task even though we are using it. How about if you need a specific GPU you just spin down the agent ? (I mean usually this is interactive session, so a developer knows they need the GPU, no?)

bmartinn avatar Jul 22 '21 23:07 bmartinn

Hi @bmartinn, thanks for the reply!

Clearml-session is definitely a helpful tool. Both options you describe require everyone in the team to use clearml. I am working in a research team, where different people work on different projects and not every project is configured to use clearml. A standard case would be someone in the team trying out some code from github, by ssh-ing into one of the GPU machines, cloning the code and running the experment, without needing to configure clearml or anything else. I assume that this is not uncommon in reseach teams. More mature projects on the other hand would use clearml and clearml-agent and all the features that come with it.

While it would certainly be possible to make everyone in the team turn off agents, before they use a certain GPU, I think an optional agent configuration of e.g. min_required_gpu_memory=6GB would only add a layer of stability. In the example you mention, currently the agent would pick the experiment anyway, having an extra check before running the experiment, the agent would at least wait until a GPU is free, hence, increasing stability. If then the agent still crashes, it would also have without this additional configuration.

I am not sure if I explained it clear enough, but I am not suggesting, that the agent should take any free GPU. The agent will always use its dedicated GPU. For example if i run an agent with clearml-agent daemon --gpus 0 --queue default, then the agent will have GPU 0 assigned to it. Now the agent will pull a task from the default queue, whenever it is available, and run it on GPU 0. My suggestion is, that istead of immediately running the experient on GPU 0, when the previous task is finished, the agent should poll the GPU and wait until enough GPU RAM is available.

klekass avatar Jul 23 '21 07:07 klekass

My suggestion is, that instead of immediately running the experiment on GPU 0, when the previous task is finished, the agent should poll the GPU and wait until enough GPU RAM is available.

So the issue here is this "check" is not "atomic", meaning let's assume the agent pulls a Task, and now it is checking for min_required_gpu_memory=6GB , assuming you just left for a coffee break and killed the debugger/process you were working on, it will just think you "went home" (freeing the gpu) and pull the Task... Maybe we need a more explicit way to manually "tell" the agent we are working on GPU 0. How about a "magic lock file", say you just do touch ~/clearml-agent-disable-gpu-0 and the agent checks if the file exists (and maybe as a precaution, not older than a day) the agent will just skip pulling on the queue ? wdyt ?

bmartinn avatar Jul 25 '21 21:07 bmartinn

I have been thinking about this problem too, but i still think it is a good idea to poll the GPU before pulling the task (GPU check should be done before pulling the task, otherwise the pulled task would hang until the GPU is free again, while another agent might have free resources in the meantime). Here are my two reasons:

1. Without the check, the agent would pull anyway, resulting in failure if the GPU is in use, as described in my previous response.

2. If the developer is leaving for a break and the agent steals the GPU, the developer is able to switch to a different GPU, if available. On the other hand, if the agent fails to run a task, because a GPU is occupied, e.g. on weekends or over night, there might be noone to restart the experiments, and the Task is gone. Assuming multiple tasks are in the queue, the agent would continue pulling all the task and failingly removes them from the queue.

The solution you are suggesting is from the developers perspective, I am thinking more from the agents perspective. Maybe a combination of both would be ideal? I.e. Using a lock file, such that the developer can make sure, his GPU is reserved and using the GPU check, such that the agent would not run experiments on an in-use GPU. wdyt?

klekass avatar Jul 26 '21 06:07 klekass

Maybe a combination of both would be ideal? I.e. Using a lock file, such that the developer can make sure, his GPU is reserved and using the GPU check, such that the agent would not run experiments on an in-use GPU. wdyt?

Okay I might be overthinking it, but could we maybe do "plugins"?! Maybe we should have a way to configure a python callback to decide whether to pull or not? The reason I'm bringing plugins is because maybe there could be another logic someone would like to implement (for example check if a user is logged-in, or maybe free RAM not GPU). Thoughts?

bmartinn avatar Jul 29 '21 21:07 bmartinn

Plugins sounds great!

klekass avatar Jul 30 '21 09:07 klekass