nni icon indicating copy to clipboard operation
nni copied to clipboard

SQLITE_IOERR when setting up rest server

Open ANFANGERMI opened this issue 2 years ago • 6 comments

Describe the issue: My program throws an SQLITE_IOERR error when setting up rest server. 41121655950651_ pic

Environment:

  • NNI version:2.6
  • Training service (local|remote|pai|aml|etc):local
  • Client OS:ubuntu 16.04
  • Server OS (for remote mode only):
  • Python version:3.8.0
  • PyTorch/TensorFlow version:
  • Is conda/virtualenv/venv used?:
  • Is running in Docker?:

Configuration:

  • Experiment config (remember to remove secrets!):
  • trial_concurrency = 1
  • tuner = 'Gridsearch'
  • Search space:
  • "search_space":{ "num_leaves": {"_type": "choice","_value": [20, 31]}, "learning_rate": {"_type": "choice","_value": [0.01, 0.05, 0.1, 0.2]}, "max_depth": {"_type": "choice","_value": [7, 10]} },

Log message:

  • nnimanager.log: [2022-06-23 02:11:31] ERROR (NNIManager) Dispatcher error: read ECONNRESET [2022-06-23 02:11:31] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed. at EventEmitter. (/home/chenj/.local/lib/python3.8/site-packages/nni_node/core/nnimanager.js:651:32) at EventEmitter.emit (node:events:394:28) at Socket. (/home/chenj/.local/lib/python3.8/site-packages/nni_node/core/ipcInterface.js:70:72) at Socket.emit (node:events:394:28) at emitErrorNT (node:internal/streams/destroy:193:8) at emitErrorCloseNT (node:internal/streams/destroy:158:3) at processTicksAndRejections (node:internal/process/task_queues:83:21) [2022-06-23 02:11:31] INFO (NNIManager) Change NNIManager status from: STOPPING to: ERROR
  • dispatcher.log:
  • nnictl stdout and stderr:

How to reproduce it?:

ANFANGERMI avatar Jun 23 '22 02:06 ANFANGERMI

Please check you have sufficient disk space and have permission to write in ~/nni-experiments/ directory.

liuzhe-lz avatar Jun 24 '22 08:06 liuzhe-lz

Yes, i check my dick space and permission of directory, but i stlii have this problem.

ANFANGERMI avatar Jun 25 '22 08:06 ANFANGERMI

[2022-06-23 02:11:31] ERROR (NNIManager) Dispatcher error: read ECONNRESET
[2022-06-23 02:11:31] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed.

Seems there's something wrong in tuner. What's the content of dispatcher.log?

liuzhe-lz avatar Jun 26 '22 21:06 liuzhe-lz

log.zip I attach nnimanager.log and dispatcher.log in this comment, but the file is so large that i put them in a zip file.

ANFANGERMI avatar Jun 27 '22 02:06 ANFANGERMI

The experiment was once finished and now cannot be resumed? You can try manually view the database with sqlite3 /data/chenj/nni_vol20D_demo4/period1/db/nni.sqlite. The path looks like a remote storage server to me. If that's the case, maybe a network fluctuation has corrupted the database file.

liuzhe-lz avatar Jun 27 '22 03:06 liuzhe-lz

Actually, i modify some source code to meet my requirements, so each time when a experiment start, the program will check if /experiment_id/db/nni.sqlite exists and then delete it.(the experiment_id is a particular string after modification).Does it has any impact on nni? Yes i run my program on a remote server and using ssh tunneling to view the WebUI in my local machine.

ANFANGERMI avatar Jun 27 '22 05:06 ANFANGERMI