nni icon indicating copy to clipboard operation
nni copied to clipboard

torch.cuda.is_available() changes to FALSE at the second trial

Open rockcor opened this issue 2 years ago • 11 comments

Describe the issue: May like issue #4853 torch.cuda.is_available() is FALSE at the second trial but TRUE at the first trial (by print). Also, it is TRUE when running the main.py. May be something wrong in my config.yml? My project works well on old-version nni without keyword use active GPU. Is this the problem comes from?

Environment:

  • NNI version: 2.8.0
  • Training service (local|remote|pai|aml|etc):local
  • Client OS:ubuntu18.04
  • Server OS (for remote mode only):
  • Python version: 3.8.0
  • PyTorch/TensorFlow version: 1.12
  • Is conda/virtualenv/venv used?: yes
  • Is running in Docker?: no

Configuration:

  • Experiment config (remember to remove secrets!):experimentName: *** searchSpaceFile: search_space.json trialCommand: python3 main.py trialCodeDirectory: . trialGpuNumber: 1 trialConcurrency: 1 maxExperimentDuration: 2h maxTrialNumber: 100 tuner: name: GridSearch trainingService: platform: local useActiveGpu: True maxTrialNumberPerGpu: 1
  • Search space:{ "split": {"_type": "choice", "_value": [4, 6]}, "gen": {"_type": "choice", "_value": [24, 48]}, "hidden_dim": {"_type": "choice", "_value": [32, 64]} }

Log message:

  • nnimanager.log:[2022-08-04 17:12:50] INFO (main) Start NNI manager [2022-08-04 17:12:50] INFO (NNIDataStore) Datastore initialization done [2022-08-04 17:12:50] INFO (RestServer) Starting REST server at port 10089, URL prefix: "/" [2022-08-04 17:12:51] WARNING (NNITensorboardManager) Tensorboard may not installed, if you want to use tensorboard, please check if tensorboard installed. [2022-08-04 17:12:51] INFO (RestServer) REST server started. [2022-08-04 17:12:51] INFO (NNIManager) Starting experiment: uenxijm9 [2022-08-04 17:12:51] INFO (NNIManager) Setup training service... [2022-08-04 17:12:51] INFO (LocalTrainingService) Construct local machine training service. [2022-08-04 17:12:51] INFO (NNIManager) Setup tuner... [2022-08-04 17:12:51] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2022-08-04 17:12:52] INFO (NNIManager) Add event listeners [2022-08-04 17:12:52] INFO (LocalTrainingService) Run local machine training service. [2022-08-04 17:12:52] WARNING (GPUScheduler) gpu_metrics file does not exist! [2022-08-04 17:12:52] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2022-08-04 17:12:52] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"split": 4, "gen_lookback": 24, "hidden_dim": 32}, "parameter_index": 0} [2022-08-04 17:12:57] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"split": 4, "gen_lookback": 24, "hidden_dim": 32}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-04 17:13:02] INFO (NNIManager) Trial job TCGyT status changed from WAITING to RUNNING [2022-08-04 17:13:16] ERROR (tuner_command_channel.WebSocketChannel) Error: Error: tuner_command_channel: Tuner closed connection at WebSocket. (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:41:49) at WebSocket.emit (node:events:538:35) at WebSocket.emitClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at Socket.socketOnClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at Socket.emit (node:events:526:28) at TCP. (node:net:687:12) [2022-08-04 17:13:16] ERROR (NNIManager) Dispatcher error: tuner_command_channel: Tuner closed connection [2022-08-04 17:13:16] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed. at EventEmitter. (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/nnimanager.js:647:32) at EventEmitter.emit (node:events:526:28) at WebSocketChannelImpl.handleError (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:107:22) at WebSocket. (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:41:37) at WebSocket.emit (node:events:538:35) at WebSocket.emitClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10) at Socket.socketOnClose (/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15) at Socket.emit (node:events:526:28) at TCP. (node:net:687:12) [2022-08-04 17:13:16] INFO (NNIManager) Change NNIManager status from: RUNNING to: ERROR

  • dispatcher.log:[2022-08-04 17:12:52] INFO (nni.tuner.gridsearch/MainThread) Ignored optimize_mode "minimize" [2022-08-04 17:12:52] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2022-08-04 17:12:52] INFO (nni.tuner.gridsearch/Thread-1) Grid initialized, size: (3×6×3) = 54 [2022-08-04 17:13:14] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. Traceback (most recent call last): File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 97, in command_queue_worker self.process_command(command, data) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 143, in process_command command_handlerscommand File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 136, in handle_report_metric_data data['value'] = load(data['value']) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/common/serializer.py", line 443, in load return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/json_tricks/nonp.py", line 236, in loads return json_loads(string, object_pairs_hook=hook, **jsonkwargs) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/json/init.py", line 370, in loads return cls(**kw).decode(s) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/json_tricks/decoders.py", line 44, in call map = hook(map, properties=self.properties) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper return encoder(*args, **{k: v for k, v in kwargs.items() if k in names}) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/common/serializer.py", line 876, in _json_tricks_any_object_decode return _wrapped_cloudpickle_loads(b) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/nni/common/serializer.py", line 882, in _wrapped_cloudpickle_loads return cloudpickle.loads(b) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/storage.py", line 218, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 713, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 930, in _legacy_load result = unpickler.load() File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 876, in persistent_load wrap_storage=restore_location(obj, location), File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 152, in _cuda_deserialize device = validate_cuda_device(location) File "/public/gsb/anaconda3/envs/pt1.12/lib/python3.8/site-packages/torch/serialization.py", line 136, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. [2022-08-04 17:13:14] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2022-08-04 17:13:16] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

  • nnictl stdout and stderr:

How to reproduce it?: nnictl create --config config.yml -p 10090

rockcor avatar Aug 04 '22 09:08 rockcor

Please print os.environ['CUDA_VISIBLE_DEVICES'] to log and tell us its value.

liuzhe-lz avatar Aug 05 '22 08:08 liuzhe-lz

Thank you for reply! I can see os.environ['CUDA_VISIBLE_DEVICES'] equals to 0 in stdout of the first trial, no matter before or after model.eval(). I set the keyword at the start of my main.py. Trail 2 did't start so there is no log for it.

rockcor avatar Aug 06 '22 15:08 rockcor

I notice placementConstraint: { type: 'None', gpus: [] } in the log, is that ok? BTW, downgrade nni to 2.2 and it can run continuously with TPE tuner.

rockcor avatar Aug 06 '22 15:08 rockcor

you can use

exp_config.trial_gpu_number = 1
exp_config.training_service.use_active_gpu = True

to set gpu, and then nni will automatic chose gpu.

Is there any code to config gpu manually in your main.py?

Louis-J avatar Oct 10 '22 08:10 Louis-J

hi @rockcor Could you follow @Louis-J suggestion and provide some other information? Thanks!

Lijiaoa avatar Oct 12 '22 01:10 Lijiaoa

I meet the same error. And I try the setting as @Louis-J metioned but it couldn't solve the problem.

Even more, I try to set trial_gpu_number as 9 and set the trial_concurrency as 1, the problem still exists.

XuHwang avatar Oct 31 '22 17:10 XuHwang

@AngusHuang17 pip install nni==2.5 may solve the problem

rockcor avatar Nov 11 '22 04:11 rockcor

you can use

exp_config.trial_gpu_number = 1
exp_config.training_service.use_active_gpu = True

to set gpu, and then nni will automatic chose gpu.

Is there any code to config gpu manually in your main.py?

you mean run with nni python api? why command line tool fails

rockcor avatar Nov 11 '22 04:11 rockcor

GPU scheduler is somehow broken currently. We will solve it in 3.0 release.

liuzhe-lz avatar Nov 11 '22 08:11 liuzhe-lz

GPU scheduler is somehow broken currently. We will solve it in 3.0 release. got it. thanks for your contribution

rockcor avatar Nov 11 '22 08:11 rockcor

This bug still happens in 3.0

AnsonAiTRAY avatar Dec 07 '23 16:12 AnsonAiTRAY