h2o-llmstudio
h2o-llmstudio copied to clipboard
ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’
🐛 Bug
q.app q.user q.client report_error: True q.events q.args report_error: True stacktrace Traceback (most recent call last):
File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)
File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))
File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()
File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])
ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’
Error None
Git Version fatal: not a git repository (or any of the parent directories): .git
To Reproduce
I'm not sure why this is happening. Hard to reproduce
LLM Studio version
v1.4.0-dev
This means you have no GPUs available. Can you run nvidia-smi to confirm everything is fine?
What's interesting is the environment just suddenly drops. It's like the GPU's just disappear after a few hours of training.
This seems to be an issue on your environment/system then unfortunately.
@jldroid19 did you figure the issue out?
@psinger I have not.
are you running this in docker?
are you running this in docker?
Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost.
I stumbled upon this recently, might be related: https://github.com/NVIDIA/nvidia-docker/issues/1469
https://github.com/NVIDIA/nvidia-container-toolkit/issues/465#issuecomment-2066182223
There seems to be some issue of gpus being suddenly gone in Docker.
closing this for now, feel free to re-open if issue persists, but looks unrelated to LLM Studio