Add GPU Support for rlaunch multi
There is currently no way to distribute GPUs among fireworks when running small jobs in parallel on one system.
An example: On NERSC, you get exclusive access to 1 Perlmutter nodes with 4 A100 GPUs. If you were to run 4 fireworks that require 1 GPU each, using rlaunch multi 4, each firework would be responsible for determining which GPUs to run on. Most python code will default to checking the CUDA_VISIBLE_DEVICES and either taking the first or all gpus resulting in an oversubscription leading to poor performance or an error.
I don't believe this implementation would work for systems with non-NVIDIA/CUDA GPUs. I believe AMD devices require setting the HIP_VISIBLE_DEVICES variable, but I don't have access to any system with multiple AMD GPUs to test that.
This might not be the best way to implement this, but it does raise a question about whether or not there is a need for a more general way to distribute non-CPU devices (GPU and TPU) among sub-jobs.