nni
nni copied to clipboard
torch.cuda.is_available() changes to FALSE at the second trial
Describe the issue: May like issue #4853 torch.cuda.is_available() is FALSE at the second trial but TRUE at the first trial (by print). Also, it is TRUE when running the main.py. May be something wrong in my config.yml? My project works well on old-version nni without keyword use active GPU. Is this the problem comes from?
Environment:
- NNI version: 2.8.0
- Training service (local|remote|pai|aml|etc):local
- Client OS:ubuntu18.04
- Server OS (for remote mode only):
- Python version: 3.8.0
- PyTorch/TensorFlow version: 1.12
- Is conda/virtualenv/venv used?: yes
- Is running in Docker?: no
Configuration:
- Experiment config (remember to remove secrets!):experimentName: *** searchSpaceFile: search_space.json trialCommand: python3 main.py trialCodeDirectory: . trialGpuNumber: 1 trialConcurrency: 1 maxExperimentDuration: 2h maxTrialNumber: 100 tuner: name: GridSearch trainingService: platform: local useActiveGpu: True maxTrialNumberPerGpu: 1
- Search space:{ "split": {"_type": "choice", "_value": [4, 6]}, "gen": {"_type": "choice", "_value": [24, 48]}, "hidden_dim": {"_type": "choice", "_value": [32, 64]} }
Log message:
-
nnimanager.log:[2022-08-04 17:12:50] INFO (main) Start NNI manager [2022-08-04 17:12:50] INFO (NNIDataStore) Datastore initialization done [2022-08-04 17:12:50] INFO (RestServer) Starting REST server at port 10089, URL prefix: "/" [2022-08-04 17:12:51] WARNING (NNITensorboardManager) Tensorboard may not installed, if you want to use tensorboard, please check if tensorboard installed. [2022-08-04 17:12:51] INFO (RestServer) REST server started. [2022-08-04 17:12:51] INFO (NNIManager) Starting experiment: uenxijm9 [2022-08-04 17:12:51] INFO (NNIManager) Setup training service... [2022-08-04 17:12:51] INFO (LocalTrainingService) Construct local machine training service. [2022-08-04 17:12:51] INFO (NNIManager) Setup tuner... [2022-08-04 17:12:51] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2022-08-04 17:12:52] INFO (NNIManager) Add event listeners [2022-08-04 17:12:52] INFO (LocalTrainingService) Run local machine training service. [2022-08-04 17:12:52] WARNING (GPUScheduler) gpu_metrics file does not exist! [2022-08-04 17:12:52] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2022-08-04 17:12:52] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"split": 4, "gen_lookback": 24, "hidden_dim": 32}, "parameter_index": 0} [2022-08-04 17:12:57] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"split": 4, "gen_lookback": 24, "hidden_dim": 32}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-04 17:13:02] INFO (NNIManager) Trial job TCGyT status changed from WAITING to RUNNING [2022-08-04 17:13:16] ERROR (tuner_command_channel.WebSocketChannel) Error: Error: tuner_command_channel: Tuner closed connection at WebSocket.
(/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:41:49) at WebSocket.emit (node:events:538:35) at WebSocket.emitClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at Socket.socketOnClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at Socket.emit (node:events:526:28) at TCP. (node:net:687:12) [2022-08-04 17:13:16] ERROR (NNIManager) Dispatcher error: tuner_command_channel: Tuner closed connection [2022-08-04 17:13:16] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed. at EventEmitter. (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/nnimanager.js:647:32) at EventEmitter.emit (node:events:526:28) at WebSocketChannelImpl.handleError (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:107:22) at WebSocket. (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:41:37) at WebSocket.emit (node:events:538:35) at WebSocket.emitClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at Socket.socketOnClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at Socket.emit (node:events:526:28) at TCP. (node:net:687:12) [2022-08-04 17:13:16] INFO (NNIManager) Change NNIManager status from: RUNNING to: ERROR -
dispatcher.log:[2022-08-04 17:12:52] INFO (nni.tuner.gridsearch/MainThread) Ignored optimize_mode "minimize" [2022-08-04 17:12:52] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2022-08-04 17:12:52] INFO (nni.tuner.gridsearch/Thread-1) Grid initialized, size: (3×6×3) = 54 [2022-08-04 17:13:14] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. Traceback (most recent call last): File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 97, in command_queue_worker self.process_command(command, data) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 143, in process_command command_handlerscommand File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 136, in handle_report_metric_data data['value'] = load(data['value']) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/common/serializer.py", line 443, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/json_tricks/nonp.py", line 236, in loads return json_loads(string, object_pairs_hook=hook, **jsonkwargs) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/json/init.py", line 370, in loads return cls(**kw).decode(s) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/json_tricks/decoders.py", line 44, in call map = hook(map, properties=self.properties) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper return encoder(*args, **{k: v for k, v in kwargs.items() if k in names}) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/common/serializer.py", line 876, in _json_tricks_any_object_decode return _wrapped_cloudpickle_loads(b) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/common/serializer.py", line 882, in _wrapped_cloudpickle_loads return cloudpickle.loads(b) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/storage.py", line 218, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 713, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 930, in _legacy_load result = unpickler.load() File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 876, in persistent_load wrap_storage=restore_location(obj, location), File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 152, in _cuda_deserialize device = validate_cuda_device(location) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 136, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. [2022-08-04 17:13:14] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2022-08-04 17:13:16] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
-
nnictl stdout and stderr:
How to reproduce it?: nnictl create --config config.yml -p 10090
Please print os.environ['CUDA_VISIBLE_DEVICES']
to log and tell us its value.
Thank you for reply!
I can see os.environ['CUDA_VISIBLE_DEVICES']
equals to 0 in stdout of the first trial, no matter before or after model.eval()
. I set the keyword at the start of my main.py
. Trail 2 did't start so there is no log for it.
I notice placementConstraint: { type: 'None', gpus: [] }
in the log, is that ok?
BTW, downgrade nni to 2.2 and it can run continuously with TPE tuner.
you can use
exp_config.trial_gpu_number = 1
exp_config.training_service.use_active_gpu = True
to set gpu, and then nni will automatic chose gpu.
Is there any code to config gpu manually in your main.py?
hi @rockcor Could you follow @Louis-J suggestion and provide some other information? Thanks!
I meet the same error. And I try the setting as @Louis-J metioned but it couldn't solve the problem.
Even more, I try to set trial_gpu_number
as 9 and set the trial_concurrency
as 1, the problem still exists.
@AngusHuang17
pip install nni==2.5
may solve the problem
you can use
exp_config.trial_gpu_number = 1 exp_config.training_service.use_active_gpu = True
to set gpu, and then nni will automatic chose gpu.
Is there any code to config gpu manually in your main.py?
you mean run with nni python api? why command line tool fails
GPU scheduler is somehow broken currently. We will solve it in 3.0 release.
GPU scheduler is somehow broken currently. We will solve it in 3.0 release. got it. thanks for your contribution
This bug still happens in 3.0