h2o-llmstudio
h2o-llmstudio copied to clipboard
[BUG] GPU ids are not checked when using cfg.yaml
🐛 Bug
Starting a new experiment from cfg.yaml causes an error if number of gpu specified in cfg.yaml exceeds the number of gpus on the target machine.
Within the UI, this issue is not noticeable, as one can only select/deselect available GPUs.
To Reproduce
- Use e.g. h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt config that specifies 4 GPUs.
- Start a new experiment on a machine with < 4 GPUs.
/home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
warn(msg)
CUDA SETUP: Loading binary /home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
2023-05-08 07:26:57,438 - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
2023-05-08 07:26:57,472 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-08 07:26:57,477 - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
2023-05-08 07:26:57,478 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-05-08 07:26:57,478 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,479 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,482 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,487 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-05-08 07:26:57,502 - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2023-05-08 07:26:57,511 - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
2023-05-08 07:26:57,522 - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
2023-05-08 07:26:57,522 - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
2023-05-08 07:26:57,522 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
2023-05-08 07:26:57,522 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-05-08 07:26:57,522 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
2023-05-08 07:26:57,523 - ERROR: Exception occurred during H2O LLM Studio run:
Traceback (most recent call last):
File "/data/maxjeblick/PyCharmProjects/h2o-llmstudio/train_wave.py", line 106, in <module>
run(cfg=cfg)
File "/data/maxjeblick/PyCharmProjects/h2o-llmstudio/train.py", line 455, in run
torch.cuda.set_device(cfg.environment._rank)
File "/home/maxjeblick/.local/share/virtualenvs/h2o-llmstudio-07dpqO7E/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Is that something that needs to be fixed? Maybe with a slighty better error message?
Is that something that needs to be fixed?
It's not a crucial bug, but probably easy to improve UX, i.e. automatically restrict to GPU ids available and also show a message in the UI or logs.
btw I also ran into this the other day and I was confused, so a fix would be good.
btw I also ran into this the other day and I was confused, so a fix would be good.
Yeah, I planed to tackle it together with this issue. Also a good issue @fatihozturkh2o if you want to have a look.