accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Can't utilize multiple GPUs on Colab environment

Open StefanTodoran opened this issue 1 year ago • 3 comments

System Info

- `Accelerate` version: 0.27.2
- Platform: Linux-6.1.58+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 14.64 GB
- GPU type: Tesla T4
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

  1. Deploy a GCE virtual machine with >1 GPUs.
  2. Connect to said VM on Google Colab.
  3. Create the accelerate config with the correct number of GPUs.
  4. Restart the environment and run the code.

Expected behavior

It doesn't seem to be possible to get accelerate to use multiple GPUs in a Google Colab environment, even if the VM Colab is connected to has multiple GPUs. The GPUs are available, with torch.cuda.device_count() returning the expected number of GPUs.

I believe the issue is in launchers.py, where we have the following problematic code:

elif in_colab:
    # No need for a distributed launch otherwise as it's either CPU or one GPU.
    if torch.cuda.is_available():
        print("Launching training on one GPU.")
    else:
        print("Launching training on one CPU.")
    function(*args)

There is very much a need for distributed launch capabilities, even in a Colab environment. I'm really unsure as to why this code was ever added in the first place... if the config specifies 4 GPUs then accelerate should try to use 4 GPUs, not just ignore this (silently no less, there is no indication of why it launches on one GPU so the user needs to delve into the code to figure it out).

StefanTodoran avatar Mar 09 '24 09:03 StefanTodoran

This is quite new having colab support multiple GPUs, so thanks for letting us know that this is something that's possible now. This was this behavior before because colab did not have this capability.

Do note that accelerate config/notebook_launcher are not tied in any way. If you're doing that please just run !accelerate launch.

Would you like to open a PR enabling this for users?

muellerzr avatar Mar 11 '24 15:03 muellerzr

Oh, are you saying notebook_launcher ignores the accelerate config file? If so I feel like this documentation on launching from notebook could use a change, because it is pretty confusing for the documentation to state that a config must exist "before any training can be performed", then state that notebook_launcher should be used to begin distributed training while making no mention of accelerate launch. The associated example notebook is the same.

I'd be happy to work on a PR enabling this for users, but I may need some feedback as I've never contributed to any huggingface open source before. I'll try to get to it later in the week and link to the PR here.

StefanTodoran avatar Mar 12 '24 22:03 StefanTodoran

Yes that does indeed need updating.

Looking forward to the PR!

muellerzr avatar Mar 13 '24 20:03 muellerzr