nni
nni copied to clipboard
SQLITE_IOERR when setting up rest server
Describe the issue:
My program throws an SQLITE_IOERR
error when setting up rest server.
Environment:
- NNI version:2.6
- Training service (local|remote|pai|aml|etc):local
- Client OS:ubuntu 16.04
- Server OS (for remote mode only):
- Python version:3.8.0
- PyTorch/TensorFlow version:
- Is conda/virtualenv/venv used?:
- Is running in Docker?:
Configuration:
- Experiment config (remember to remove secrets!):
-
trial_concurrency
= 1 -
tuner
= 'Gridsearch' - Search space:
- "search_space":{ "num_leaves": {"_type": "choice","_value": [20, 31]}, "learning_rate": {"_type": "choice","_value": [0.01, 0.05, 0.1, 0.2]}, "max_depth": {"_type": "choice","_value": [7, 10]} },
Log message:
- nnimanager.log:
[2022-06-23 02:11:31] ERROR (NNIManager) Dispatcher error: read ECONNRESET
[2022-06-23 02:11:31] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed.
at EventEmitter.
(/home/chenj/.local/lib/python3.8/site-packages/nni_node/core/nnimanager.js:651:32) at EventEmitter.emit (node:events:394:28) at Socket. (/home/chenj/.local/lib/python3.8/site-packages/nni_node/core/ipcInterface.js:70:72) at Socket.emit (node:events:394:28) at emitErrorNT (node:internal/streams/destroy:193:8) at emitErrorCloseNT (node:internal/streams/destroy:158:3) at processTicksAndRejections (node:internal/process/task_queues:83:21) [2022-06-23 02:11:31] INFO (NNIManager) Change NNIManager status from: STOPPING to: ERROR - dispatcher.log:
- nnictl stdout and stderr:
How to reproduce it?:
Please check you have sufficient disk space and have permission to write in ~/nni-experiments/ directory.
Yes, i check my dick space and permission of directory, but i stlii have this problem.
[2022-06-23 02:11:31] ERROR (NNIManager) Dispatcher error: read ECONNRESET
[2022-06-23 02:11:31] ERROR (NNIManager) Error: Dispatcher stream error, tuner may have crashed.
Seems there's something wrong in tuner. What's the content of dispatcher.log
?
log.zip
I attach nnimanager.log
and dispatcher.log
in this comment, but the file is so large that i put them in a zip file.
The experiment was once finished and now cannot be resumed?
You can try manually view the database with sqlite3 /data/chenj/nni_vol20D_demo4/period1/db/nni.sqlite
.
The path looks like a remote storage server to me. If that's the case, maybe a network fluctuation has corrupted the database file.
Actually, i modify some source code to meet my requirements, so each time when a experiment start, the program will check if /experiment_id/db/nni.sqlite
exists and then delete it.(the experiment_id
is a particular string after modification).Does it has any impact on nni
?
Yes i run my program on a remote server and using ssh tunneling to view the WebUI in my local machine.