cog icon indicating copy to clipboard operation
cog copied to clipboard

Specifying GPU rank on multi-GPU systems

Open afiaka87 opened this issue 2 years ago • 2 comments

Laion.ai recently has been donated access to a few 8xA100 pods, of which I'm able to use.

Tinkering with using cog for inference on these is tricky because it is a shared instance. This means, whenever I log in - someone may be using some amount of the GPU's. As such, it's useful to be able to specify the device globally, rather than edit each line referencing the device rank in the framework of your choosing.

The framework for me is essentially always pytorch, which respects the CUDA_VISIBLE_DEVICES parameter allowing you to specify the "world" of GPU's for reference. CUDA_VISIBLE_DEVICES=6,7 will show up to pytorch as torch.device("cuda:0") and torch.device("cuda:1"), respectively. This is more useful for training multi-gpu, but it also allows you to override the rank 0 GPU (i.e. torch.device("cuda")), which is commonly used throughout the code to move memory around in inference scripts.

In an ideal world, this would be specifiable as a parameter to the prediction script. In practice, hardcoded calls .to("cuda") and the equivalent .cuda() are common. People tend to develop on single-GPU or on a multi-GPU where they have access to the rank 0 GPU, after all.

tl;dr:

Running cog containers on a multi-gpu environment is tricky assuming you don't want to change the repo's code.

Solutions?

We could expose the user's environment to the docker container This could pose strange security issues?

We could allow users to manually set environment variables.

Perhaps the cog command could respect a cog.env file if it exists? Alternatively, a -e flag might work here (if it isn't already used). For instance:

cog predict -i text="my prompt" -i batch_size=4 -e "CUDA_VISIBLE_DEVICES=4".

What do you think?

afiaka87 avatar May 03 '22 18:05 afiaka87

Hi @afiaka87, When I use -e flag, it doesn't work. May you explain more about how you specify GPU rank on multi-gpu systems?

allenhung1025 avatar May 23 '22 08:05 allenhung1025

Hi @afiaka87, When I was trying to run this model on multi-gpu systems, I got some error Starting Docker image cog-looptest-base and running setup()... docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. Screen Shot 2022-05-23 at 4 21 05 PM

allenhung1025 avatar May 23 '22 08:05 allenhung1025

Experiencing the same issue: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. but with a different model.

@allenhung1025 were you able to resolve it ?

Shivam010 avatar Nov 12 '22 20:11 Shivam010