nni icon indicating copy to clipboard operation
nni copied to clipboard

Dispatcher stream error, tuner may have crashed (Error on first time)

Open koklimabc opened this issue 4 years ago • 4 comments

Environment: Ubuntu 20.04

  • NNI version: 2.1
  • NNI mode (local|remote|pai):
  • Client OS: Desbian
  • Server OS (for remote mode only): N\A
  • Python version: 3.7
  • PyTorch/TensorFlow version: 1.8.0 & 2.4.1
  • Is conda/virtualenv/venv used?: N\A
  • Is running in Docker?: N\A

Log message:

  • nnimanager.log:

[2021-03-25 12:07:53] INFO [ 'Datastore initialization done' ] [2021-03-25 12:07:53] INFO [ 'RestServer start' ] [2021-03-25 12:07:53] INFO [ 'Construct local machine training service.' ] [2021-03-25 12:07:53] INFO [ 'RestServer base port is 8080' ] [2021-03-25 12:07:53] INFO [ 'Rest server listening on: http://0.0.0.0:8080' ] [2021-03-25 12:07:53] INFO [ 'NNIManager setClusterMetadata, key: trial_config, value: {"command":"python3 mnist-keras.py","codeDir":"/home/pi/Downloads/nni_sample/nni/examples/trials/mnist-keras/.","gpuNum":0}' ] [2021-03-25 12:07:53] INFO [ 'required GPU number is 0' ] [2021-03-25 12:07:53] INFO [ 'Starting experiment: 236L9qwk' ] [2021-03-25 12:07:53] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ] [2021-03-25 12:07:53] INFO [ 'Add event listeners' ] [2021-03-25 12:07:53] ERROR [ 'Dispatcher error: This socket has been ended by the other party' ] [2021-03-25 12:07:53] ERROR [ 'Error: Dispatcher stream error, tuner may have crashed.\n at EventEmitter.dispatcher.onError (/usr/local/lib/python3.7/dist-packages/nni_node/core/nnimanager.js:550:32)\n at EventEmitter.emit (events.js:198:13)\n at Socket.IpcInterface.outgoingStream.on (/usr/local/lib/python3.7/dist-packages/nni_node/core/ipcInterface.js:42:72)\n at Socket.emit (events.js:198:13)\n at Socket.writeAfterFIN [as write] (net.js:399:8)\n at IpcInterface.sendCommand (/usr/local/lib/python3.7/dist-packages/nni_node/core/ipcInterface.js:49:38)\n at NNIManager.sendInitTunerCommands (/usr/local/lib/python3.7/dist-packages/nni_node/core/nnimanager.js:558:25)\n at NNIManager.run (/usr/local/lib/python3.7/dist-packages/nni_node/core/nnimanager.js:523:14)\n at NNIManager.startExperiment (/usr/local/lib/python3.7/dist-packages/nni_node/core/nnimanager.js:135:14)' ] [2021-03-25 12:07:53] INFO [ 'Change NNIManager status from: RUNNING to: ERROR' ] [2021-03-25 12:07:53] WARNING [ 'Commands jammed in buffer!' ] [2021-03-25 12:07:53] INFO [ 'Run local machine training service.' ]

  • dispatcher.log:
  • Neural Network Intelligence
  • nnictl stdout and stderr:

cd /home/User/Downloads/nni_sample/nni/examples/trials/mnist-keras nnictl create --config config.yml

Problem How Should I able to fix this issues since I'm working to learn the basic of NNI. TQ.

koklimabc avatar Mar 25 '21 04:03 koklimabc

Do you mean dispatcher.log contains <!doctype html><title>... stuff? Could you upload dispatcher.log as attachment?

liuzhe-lz avatar Mar 29 '21 02:03 liuzhe-lz

hello @koklimabc, could you upgrade the version of NNI and try again? If this issue still exists, please upload dispatcher.log as an attachment? Thank you!

kvartet avatar Jun 10 '21 13:06 kvartet

i'm using the latest version of NNI, but encounterd the similar error. [2021-11-29 09:43:54] INFO (NNIDataStore) Datastore initialization done [2021-11-29 09:43:54] INFO (RestServer) RestServer start [2021-11-29 09:43:54] WARNING (NNITensorboardManager) Tensorboard may not installed, if you want to use tensorboard, please check if tensorboard installed. [2021-11-29 09:43:54] INFO (RestServer) RestServer base port is 8080 [2021-11-29 09:43:54] INFO (main) Rest server listening on: http://0.0.0.0:8080 [2021-11-29 09:43:55] INFO (NNIManager) Starting experiment: 3mZT9tIn [2021-11-29 09:43:55] INFO (NNIManager) Setup training service... [2021-11-29 09:43:55] INFO (LocalTrainingService) Construct local machine training service. [2021-11-29 09:43:55] INFO (NNIManager) Setup tuner... [2021-11-29 09:43:55] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2021-11-29 09:43:55] INFO (NNIManager) Add event listeners [2021-11-29 09:43:55] INFO (LocalTrainingService) Run local machine training service. [2021-11-29 09:43:55] ERROR (NNIManager) Dispatcher error: read ECONNRESET [2021-11-29 09:43:55] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed. at EventEmitter.<anonymous> (/home/biopharm/llf/anaconda3/envs/pytorch/lib/python3.8/site-packages/nni_node/core/nnimanager.js:650:32) at EventEmitter.emit (node:events:394:28) at Socket.<anonymous> (/home/biopharm/llf/anaconda3/envs/pytorch/lib/python3.8/site-packages/nni_node/core/ipcInterface.js:70:72) at Socket.emit (node:events:394:28) at emitErrorNT (node:internal/streams/destroy:193:8) at emitErrorCloseNT (node:internal/streams/destroy:158:3) at processTicksAndRejections (node:internal/process/task_queues:83:21) [2021-11-29 09:43:55] INFO (NNIManager) Change NNIManager status from: RUNNING to: ERROR

Here is the dispatcher.log file. dispatcher.log

@kvartet @liuzhe-lz

Chenwf1025 avatar Nov 29 '21 01:11 Chenwf1025

i'm using the latest version of NNI, but encounterd the similar error. [2021-11-29 09:43:54] INFO (NNIDataStore) Datastore initialization done [2021-11-29 09:43:54] INFO (RestServer) RestServer start [2021-11-29 09:43:54] WARNING (NNITensorboardManager) Tensorboard may not installed, if you want to use tensorboard, please check if tensorboard installed. [2021-11-29 09:43:54] INFO (RestServer) RestServer base port is 8080 [2021-11-29 09:43:54] INFO (main) Rest server listening on: http://0.0.0.0:8080 [2021-11-29 09:43:55] INFO (NNIManager) Starting experiment: 3mZT9tIn [2021-11-29 09:43:55] INFO (NNIManager) Setup training service... [2021-11-29 09:43:55] INFO (LocalTrainingService) Construct local machine training service. [2021-11-29 09:43:55] INFO (NNIManager) Setup tuner... [2021-11-29 09:43:55] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2021-11-29 09:43:55] INFO (NNIManager) Add event listeners [2021-11-29 09:43:55] INFO (LocalTrainingService) Run local machine training service. [2021-11-29 09:43:55] ERROR (NNIManager) Dispatcher error: read ECONNRESET [2021-11-29 09:43:55] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed. at EventEmitter.<anonymous> (/home/biopharm/llf/anaconda3/envs/pytorch/lib/python3.8/site-packages/nni_node/core/nnimanager.js:650:32) at EventEmitter.emit (node:events:394:28) at Socket.<anonymous> (/home/biopharm/llf/anaconda3/envs/pytorch/lib/python3.8/site-packages/nni_node/core/ipcInterface.js:70:72) at Socket.emit (node:events:394:28) at emitErrorNT (node:internal/streams/destroy:193:8) at emitErrorCloseNT (node:internal/streams/destroy:158:3) at processTicksAndRejections (node:internal/process/task_queues:83:21) [2021-11-29 09:43:55] INFO (NNIManager) Change NNIManager status from: RUNNING to: ERROR

Here is the dispatcher.log file. dispatcher.log

@kvartet @liuzhe-lz

@Chenwf1025 @koklimabc - are you still facing the issue with the latest release of nni?

scarlett2018 avatar Aug 15 '22 08:08 scarlett2018