h2o-llmstudio [BUG] GPU ids are not checked when using cfg.yaml

trafficstars

🐛 Bug

Starting a new experiment from cfg.yaml causes an error if number of gpu specified in cfg.yaml exceeds the number of gpus on the target machine. Within the UI, this issue is not noticeable, as one can only select/deselect available GPUs.

To Reproduce

Use e.g. h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt config that specifies 4 GPUs.
Start a new experiment on a machine with < 4 GPUs.

/home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Loading binary /home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
2023-05-08 07:26:57,438 - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
2023-05-08 07:26:57,472 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-08 07:26:57,477 - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
2023-05-08 07:26:57,478 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-05-08 07:26:57,478 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,479 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,482 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,487 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,502 - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2023-05-08 07:26:57,511 - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
2023-05-08 07:26:57,522 - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
2023-05-08 07:26:57,522 - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
2023-05-08 07:26:57,522 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
2023-05-08 07:26:57,522 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
2023-05-08 07:26:57,523 - ERROR: Exception occurred during H2O LLM Studio run:
Traceback (most recent call last):
  File "/data/maxjeblick/PyCharmProjects/h2o-llmstudio/train_wave.py", line 106, in <module>
    run(cfg=cfg)
  File "/data/maxjeblick/PyCharmProjects/h2o-llmstudio/train.py", line 455, in run
    torch.cuda.set_device(cfg.environment._rank)
  File "/home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

May 08 '23 07:05 maxjeblick

Is that something that needs to be fixed? Maybe with a slighty better error message?

May 08 '23 07:05 pascal-pfeiffer

Is that something that needs to be fixed?

It's not a crucial bug, but probably easy to improve UX, i.e. automatically restrict to GPU ids available and also show a message in the UI or logs.

May 08 '23 07:05 maxjeblick

btw I also ran into this the other day and I was confused, so a fix would be good.

Jul 11 '23 13:07 psinger

btw I also ran into this the other day and I was confused, so a fix would be good.

Yeah, I planed to tackle it together with this issue. Also a good issue @fatihozturkh2o if you want to have a look.

Jul 11 '23 15:07 maxjeblick

h2o-llmstudio h2o-llmstudio copied to clipboard

[BUG] GPU ids are not checked when using cfg.yaml

🐛 Bug

To Reproduce

h2o-llmstudio
h2o-llmstudio copied to clipboard