h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

[BUG] GPU ids are not checked when using cfg.yaml

Open maxjeblick opened this issue 2 years ago • 2 comments
trafficstars

🐛 Bug

Starting a new experiment from cfg.yaml causes an error if number of gpu specified in cfg.yaml exceeds the number of gpus on the target machine. Within the UI, this issue is not noticeable, as one can only select/deselect available GPUs.

To Reproduce

/home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Loading binary /home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
2023-05-08 07:26:57,438 - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
2023-05-08 07:26:57,472 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-08 07:26:57,477 - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
2023-05-08 07:26:57,478 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-05-08 07:26:57,478 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,479 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,482 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,487 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,502 - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2023-05-08 07:26:57,511 - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
2023-05-08 07:26:57,522 - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
2023-05-08 07:26:57,522 - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
2023-05-08 07:26:57,522 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
2023-05-08 07:26:57,522 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
2023-05-08 07:26:57,523 - ERROR: Exception occurred during H2O LLM Studio run:
Traceback (most recent call last):
  File "/data/maxjeblick/PyCharmProjects/h2o-llmstudio/train_wave.py", line 106, in <module>
    run(cfg=cfg)
  File "/data/maxjeblick/PyCharmProjects/h2o-llmstudio/train.py", line 455, in run
    torch.cuda.set_device(cfg.environment._rank)
  File "/home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

maxjeblick avatar May 08 '23 07:05 maxjeblick

Is that something that needs to be fixed? Maybe with a slighty better error message?

pascal-pfeiffer avatar May 08 '23 07:05 pascal-pfeiffer

Is that something that needs to be fixed?

It's not a crucial bug, but probably easy to improve UX, i.e. automatically restrict to GPU ids available and also show a message in the UI or logs.

maxjeblick avatar May 08 '23 07:05 maxjeblick

btw I also ran into this the other day and I was confused, so a fix would be good.

psinger avatar Jul 11 '23 13:07 psinger

btw I also ran into this the other day and I was confused, so a fix would be good.

Yeah, I planed to tackle it together with this issue. Also a good issue @fatihozturkh2o if you want to have a look.

maxjeblick avatar Jul 11 '23 15:07 maxjeblick