cog
cog copied to clipboard
Specifying GPU rank on multi-GPU systems
Laion.ai recently has been donated access to a few 8xA100 pods, of which I'm able to use.
Tinkering with using cog
for inference on these is tricky because it is a shared instance. This means, whenever I log in - someone may be using some amount of the GPU's. As such, it's useful to be able to specify the device globally, rather than edit each line referencing the device rank in the framework of your choosing.
The framework for me is essentially always pytorch, which respects the CUDA_VISIBLE_DEVICES
parameter allowing you to specify the "world" of GPU's for reference. CUDA_VISIBLE_DEVICES=6,7
will show up to pytorch as torch.device("cuda:0")
and torch.device("cuda:1")
, respectively. This is more useful for training multi-gpu, but it also allows you to override the rank 0 GPU (i.e. torch.device("cuda")
), which is commonly used throughout the code to move memory around in inference scripts.
In an ideal world, this would be specifiable as a parameter to the prediction script. In practice, hardcoded calls .to("cuda")
and the equivalent .cuda()
are common. People tend to develop on single-GPU or on a multi-GPU where they have access to the rank 0 GPU, after all.
tl;dr:
Running cog containers on a multi-gpu environment is tricky assuming you don't want to change the repo's code.
Solutions?
We could expose the user's environment to the docker container This could pose strange security issues?
We could allow users to manually set environment variables.
Perhaps the cog
command could respect a cog.env
file if it exists?
Alternatively, a -e
flag might work here (if it isn't already used). For instance:
cog predict -i text="my prompt" -i batch_size=4 -e "CUDA_VISIBLE_DEVICES=4"
.
What do you think?
Hi @afiaka87,
When I use -e
flag, it doesn't work.
May you explain more about how you specify GPU rank on multi-gpu systems?
Hi @afiaka87,
When I was trying to run this model on multi-gpu systems, I got some error Starting Docker image cog-looptest-base and running setup()... docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Experiencing the same issue: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
but with a different model.
@allenhung1025 were you able to resolve it ?