h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

Open jldroid19 opened this issue 1 year ago • 8 comments

🐛 Bug

q.app q.user q.client report_error: True q.events q.args report_error: True stacktrace Traceback (most recent call last):

File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)

File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))

File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

Error None

Git Version fatal: not a git repository (or any of the parent directories): .git

To Reproduce

I'm not sure why this is happening. Hard to reproduce

LLM Studio version

v1.4.0-dev

jldroid19 avatar Apr 11 '24 13:04 jldroid19

This means you have no GPUs available. Can you run nvidia-smi to confirm everything is fine?

psinger avatar Apr 11 '24 13:04 psinger

image

image

What's interesting is the environment just suddenly drops. It's like the GPU's just disappear after a few hours of training.

jldroid19 avatar Apr 11 '24 13:04 jldroid19

This seems to be an issue on your environment/system then unfortunately.

psinger avatar Apr 15 '24 06:04 psinger

@jldroid19 did you figure the issue out?

psinger avatar Apr 22 '24 07:04 psinger

@psinger I have not.

jldroid19 avatar Apr 22 '24 11:04 jldroid19

are you running this in docker?

psinger avatar Apr 24 '24 12:04 psinger

are you running this in docker?

Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost.

jldroid19 avatar Apr 24 '24 14:04 jldroid19

I stumbled upon this recently, might be related: https://github.com/NVIDIA/nvidia-docker/issues/1469

https://github.com/NVIDIA/nvidia-container-toolkit/issues/465#issuecomment-2066182223

There seems to be some issue of gpus being suddenly gone in Docker.

psinger avatar May 03 '24 11:05 psinger

closing this for now, feel free to re-open if issue persists, but looks unrelated to LLM Studio

psinger avatar Jul 11 '24 15:07 psinger