nni icon indicating copy to clipboard operation
nni copied to clipboard

Infinite "collect_gpu_info" and gpu not found

Open chachus opened this issue 1 year ago • 10 comments

Describe the issue: During the Hello NAS tutorial for v3.0rc1, when i launch the experiment as described a number of issues arise:

  1. Even though i edit exp.config.trial_gpu_number = 1 and exp.config.training_service.use_active_gpu = True, the logger prints "no gpu found, edit exp.config.trial_gpu_number.
  2. The computer slows down considerably, probably because is training using the CPU and not the GPU as indicated. Also checking the system monitor i can se lots of instances of "collect_gpu_info", here is the screenshot: screenshot. i don't know if it is the intended behavior.
  3. Killing the experiment doesn't stop all this processes.

Environment:

  • NNI version: 3.0rc1
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Ubuntu
  • Server OS (for remote mode only):
  • Python version: 3.10
  • PyTorch/TensorFlow version: PyTorch 1.13
  • Is conda/virtualenv/venv used?: yes
  • Is running in Docker?: no

Log message:

  • nnimanager.log:
  • dispatcher.log:
  • nnictl stdout and stderr:

How to reproduce it?: Hello NAS! tutorial for v3.0rc01

chachus avatar May 23 '23 09:05 chachus

Looks like a bug from GPU metric collector. @liuzhe-lz could you take a look?

ultmaster avatar May 26 '23 02:05 ultmaster

Please try the script directly (python -m nni.tools.nni_manager_scripts.collect_gpu_info) and tell us the output.

liuzhe-lz avatar May 29 '23 02:05 liuzhe-lz

Sure, this is the output: {"gpuNumber": 1, "gpus": [{"index": 0, "gpuCoreUtilization": 0.01, "gpuMemoryUtilization": 0.05}], "processes": [{"pid": 2520, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 315121664}, {"pid": 2664, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 160632832}, {"pid": 3158, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 55185408}, {"pid": 6158, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 383524864}, {"pid": 6495, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 4526080}, {"pid": 10007, "gpuIndex": 0, "type": "graphics", "usedGpuMemory": 93855744}], "success": true}

chachus avatar May 29 '23 08:05 chachus

What's the content of ~/nni-experiments/<EXP-ID>/log/nnimanager.log? Seems collect_gpu_info is working well.

liuzhe-lz avatar Jun 02 '23 06:06 liuzhe-lz

Sorry i probabily forgot to link the files in the issue. Here the logs: experiment.log nnimanager.log

If i use the tutorial what happens is:

[2023-06-02 10:47:01] Config is not provided. Will try to infer. 
[2023-06-02 10:47:01] Using execution engine based on training service. Trial concurrency is set to 1.
[2023-06-02 10:47:01] Using simplified model format. 
[2023-06-02 10:47:01] Using local training service. 
[2023-06-02 10:47:01] WARNING: GPU found but will not be used. Please set experiment.config.trial_gpu_numberto the number of GPUs you want to use for each trial. 
[2023-06-02 10:47:01] Creating experiment, Experiment ID: 1v85b07z [2023-06-02 10:47:02] Starting web server... 
[2023-06-02 10:47:05] ERROR: Create experiment failed: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbe3cb8a860>: Failed to establish a new connection: [Errno 111] Connection refused'))

The warning GPU found but not used appears even though trial_gpu_number is set to 1. After this the pc slows down because the process is not killed even after the error (the connection port results still occupied), and leaves a ton of zombie processes of collect_gpu_info, as reported.

chachus avatar Jun 02 '23 08:06 chachus

Hi, I have the same issue as you, when launch NNI v3.0rc1 an infinite amount of collect_gpu_info process. I tried it on different PC, linux and Windows and all have the same problem. Is there a solution for it? Thanks,

ferqui avatar Jul 22 '23 11:07 ferqui

Not that i know at the moment. I'm waiting the next release hoping for a fix.

chachus avatar Jul 26 '23 13:07 chachus

Please try the script directly (python -m nni.tools.nni_manager_scripts.collect_gpu_info) and tell us the output.

I saw that the version 3.0 has been published, but i still encounter this bug. Any help?

chachus avatar Sep 20 '23 14:09 chachus

Also encountered the same question on Hello NAS tutorial for v3.0 (21/8/2023 Latest on Sep 14 ). But if I run on CPU, all performed well, with open-abled web URL as well as 3 succeed trails. The log was on company computer, which is not accessing to internet. So i just type the ERRPR log i saw on the log: ... INFO (nni.nas.experiment.experiment) Experiment initialized successfully. Starting exploration strategy... ERROR (nni.nas.strategy.base) Strategy failed to execute. ERROR(Thread-5 (listen):nni.runtime.command_channel.websocket.channel) Failed to receive command. Retry in 0s ...

levisocool avatar Nov 08 '23 08:11 levisocool

I have this problem too with NNI version 3.0. In my case NNI looks like it is running as I get no errors:

$ nnictl create --config nni_config.yaml --port 8001
[2024-02-15 13:38:12] Creating experiment, Experiment ID: 20efndaz
[2024-02-15 13:38:12] Starting web server...
[2024-02-15 13:38:13] Setting up...
[2024-02-15 13:38:13] Web portal URLs: http://127.0.0.1:8001 http://172.17.0.4:8001
[2024-02-15 13:38:13] To stop experiment run "nnictl stop 20efndaz" or "nnictl stop --all"
[2024-02-15 13:38:13] Reference: https://nni.readthedocs.io/en/stable/reference/nnictl.html

but all that happens is my server fills up with nni.tools.nni_manager_scripts.collect_gpu_info processes which I have to kill.

If I run using CPUs it seems to be fine.

I'm using tensorflow on debian, but I've also tried in an ubuntu docker image and get the same result.

kiramt avatar Feb 15 '24 14:02 kiramt