nni
nni copied to clipboard
Infinite "collect_gpu_info" and gpu not found
Describe the issue: During the Hello NAS tutorial for v3.0rc1, when i launch the experiment as described a number of issues arise:
- Even though i edit
exp.config.trial_gpu_number = 1
andexp.config.training_service.use_active_gpu = True
, the logger prints "no gpu found, edit exp.config.trial_gpu_number. - The computer slows down considerably, probably because is training using the CPU and not the GPU as indicated. Also checking the system monitor i can se lots of instances of "collect_gpu_info", here is the screenshot:
. i don't know if it is the intended behavior.
- Killing the experiment doesn't stop all this processes.
Environment:
- NNI version: 3.0rc1
- Training service (local|remote|pai|aml|etc): local
- Client OS: Ubuntu
- Server OS (for remote mode only):
- Python version: 3.10
- PyTorch/TensorFlow version: PyTorch 1.13
- Is conda/virtualenv/venv used?: yes
- Is running in Docker?: no
Log message:
- nnimanager.log:
- dispatcher.log:
- nnictl stdout and stderr:
How to reproduce it?: Hello NAS! tutorial for v3.0rc01
Looks like a bug from GPU metric collector. @liuzhe-lz could you take a look?
Please try the script directly (python -m nni.tools.nni_manager_scripts.collect_gpu_info
) and tell us the output.
Sure, this is the output:
{"gpuNumber": 1, "gpus": [{"index": 0, "gpuCoreUtilization": 0.01, "gpuMemoryUtilization": 0.05}], "processes": [{"pid": 2520, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 315121664}, {"pid": 2664, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 160632832}, {"pid": 3158, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 55185408}, {"pid": 6158, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 383524864}, {"pid": 6495, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 4526080}, {"pid": 10007, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 93855744}], "success": true}
What's the content of ~/nni-experiments/<EXP-ID>/log/nnimanager.log
?
Seems collect_gpu_info
is working well.
Sorry i probabily forgot to link the files in the issue. Here the logs: experiment.log nnimanager.log
If i use the tutorial what happens is:
[2023-06-02 10:47:01] Config is not provided. Will try to infer.
[2023-06-02 10:47:01] Using execution engine based on training service. Trial concurrency is set to 1.
[2023-06-02 10:47:01] Using simplified model format.
[2023-06-02 10:47:01] Using local training service.
[2023-06-02 10:47:01] WARNING: GPU found but will not be used. Please set experiment.config.trial_gpu_numberto the number of GPUs you want to use for each trial.
[2023-06-02 10:47:01] Creating experiment, Experiment ID: 1v85b07z [2023-06-02 10:47:02] Starting web server...
[2023-06-02 10:47:05] ERROR: Create experiment failed: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbe3cb8a860>: Failed to establish a new connection: [Errno 111] Connection refused'))
The warning GPU found but not used appears even though trial_gpu_number is set to 1. After this the pc slows down because the process is not killed even after the error (the connection port results still occupied), and leaves a ton of zombie processes of collect_gpu_info, as reported.
Hi, I have the same issue as you, when launch NNI v3.0rc1 an infinite amount of collect_gpu_info process. I tried it on different PC, linux and Windows and all have the same problem. Is there a solution for it? Thanks,
Not that i know at the moment. I'm waiting the next release hoping for a fix.
Please try the script directly (
python -m nni.tools.nni_manager_scripts.collect_gpu_info
) and tell us the output.
I saw that the version 3.0 has been published, but i still encounter this bug. Any help?
Also encountered the same question on Hello NAS tutorial for v3.0 (21/8/2023 Latest on Sep 14 ). But if I run on CPU, all performed well, with open-abled web URL as well as 3 succeed trails. The log was on company computer, which is not accessing to internet. So i just type the ERRPR log i saw on the log: ... INFO (nni.nas.experiment.experiment) Experiment initialized successfully. Starting exploration strategy... ERROR (nni.nas.strategy.base) Strategy failed to execute. ERROR(Thread-5 (listen):nni.runtime.command_channel.websocket.channel) Failed to receive command. Retry in 0s ...
I have this problem too with NNI version 3.0. In my case NNI looks like it is running as I get no errors:
$ nnictl create --config nni_config.yaml --port 8001
[2024-02-15 13:38:12] Creating experiment, Experiment ID: 20efndaz
[2024-02-15 13:38:12] Starting web server...
[2024-02-15 13:38:13] Setting up...
[2024-02-15 13:38:13] Web portal URLs: http://127.0.0.1:8001 http://172.17.0.4:8001
[2024-02-15 13:38:13] To stop experiment run "nnictl stop 20efndaz" or "nnictl stop --all"
[2024-02-15 13:38:13] Reference: https://nni.readthedocs.io/en/stable/reference/nnictl.html
but all that happens is my server fills up with nni.tools.nni_manager_scripts.collect_gpu_info processes which I have to kill.
If I run using CPUs it seems to be fine.
I'm using tensorflow on debian, but I've also tried in an ubuntu docker image and get the same result.