nni
nni copied to clipboard
All trials under Chinese Name folder failed when running the mnist examples(pytroch)
Describe the issue: I have runed mnist pytorch example , but all trials had failed. Environment:
-
NNI version:2.80
-
Training service (local|remote|pai|aml|etc):local
-
Client OS: Windows11
-
Server OS (for remote mode only):None
-
Python version:3.9.7
-
PyTorch version:torch 1.10.2 (use the Anaconda internal Python interpreter)
-
Is conda/virtualenv/venv used?: conda env(base)
-
Is running in Docker?:None
the trial error log It's not trial log in the trials directory {"error":"File not found: C:\Users\唐勇强\nni-experiments\jnl15sro\trials\XJ3S3\trial.log"}`
Log message:
-
nnimanager.log: `[2022-08-25 23:39:41] INFO (main) Start NNI manager [2022-08-25 23:39:41] INFO (NNIDataStore) Datastore initialization done [2022-08-25 23:39:41] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2022-08-25 23:39:41] INFO (RestServer) REST server started. [2022-08-25 23:39:42] INFO (NNIManager) Starting experiment: jnl15sro [2022-08-25 23:39:42] INFO (NNIManager) Setup training service... [2022-08-25 23:39:42] INFO (LocalTrainingService) Construct local machine training service. [2022-08-25 23:39:42] INFO (NNIManager) Setup tuner... [2022-08-25 23:39:42] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING [2022-08-25 23:39:43] INFO (NNIManager) Add event listeners [2022-08-25 23:39:43] INFO (LocalTrainingService) Run local machine training service. [2022-08-25 23:39:43] INFO (NNIManager) NNIManager received command from dispatcher: ID, [2022-08-25 23:39:43] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.9625271719636488}, "parameter_index": 0} [2022-08-25 23:39:48] INFO (NNIManager) submitTrialJob: form: { sequenceId: 0, hyperParameters: { value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.9625271719636488}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:39:58] INFO (NNIManager) Trial job XJ3S3 status changed from WAITING to FAILED [2022-08-25 23:39:58] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 512, "lr": 0.0001, "momentum": 0.17049052927161668}, "parameter_index": 0} [2022-08-25 23:40:03] INFO (NNIManager) submitTrialJob: form: { sequenceId: 1, hyperParameters: { value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 512, "lr": 0.0001, "momentum": 0.17049052927161668}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:40:08] INFO (NNIManager) Trial job YZUnM status changed from WAITING to FAILED [2022-08-25 23:40:08] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 1024, "lr": 0.0001, "momentum": 0.4641360565997277}, "parameter_index": 0} [2022-08-25 23:40:13] INFO (NNIManager) submitTrialJob: form: { sequenceId: 2, hyperParameters: { value: '{"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 1024, "lr": 0.0001, "momentum": 0.4641360565997277}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:40:19] INFO (NNIManager) Trial job VZJ4V status changed from WAITING to FAILED [2022-08-25 23:40:19] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 3, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 256, "lr": 0.1, "momentum": 0.09260907090425263}, "parameter_index": 0} [2022-08-25 23:40:24] INFO (NNIManager) submitTrialJob: form: { sequenceId: 3, hyperParameters: { value: '{"parameter_id": 3, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 256, "lr": 0.1, "momentum": 0.09260907090425263}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:40:29] INFO (NNIManager) Trial job IwZ3y status changed from WAITING to FAILED [2022-08-25 23:40:29] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 4, "parameter_source": "algorithm", "parameters": {"batch_size": 64, "hidden_size": 256, "lr": 0.01, "momentum": 0.6022108230948574}, "parameter_index": 0} [2022-08-25 23:40:34] INFO (NNIManager) submitTrialJob: form: { sequenceId: 4, hyperParameters: { value: '{"parameter_id": 4, "parameter_source": "algorithm", "parameters": {"batch_size": 64, "hidden_size": 256, "lr": 0.01, "momentum": 0.6022108230948574}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:40:39] INFO (NNIManager) Trial job ffqVg status changed from WAITING to FAILED [2022-08-25 23:40:39] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 5, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.0001, "momentum": 0.9069749674371951}, "parameter_index": 0} [2022-08-25 23:40:44] INFO (NNIManager) submitTrialJob: form: { sequenceId: 5, hyperParameters: { value: '{"parameter_id": 5, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.0001, "momentum": 0.9069749674371951}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:40:49] INFO (NNIManager) Trial job YQ54k status changed from WAITING to FAILED [2022-08-25 23:40:49] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 6, "parameter_source": "algorithm", "parameters": {"batch_size": 64, "hidden_size": 512, "lr": 0.01, "momentum": 0.4989949571130713}, "parameter_index": 0} [2022-08-25 23:40:54] INFO (NNIManager) submitTrialJob: form: { sequenceId: 6, hyperParameters: { value: '{"parameter_id": 6, "parameter_source": "algorithm", "parameters": {"batch_size": 64, "hidden_size": 512, "lr": 0.01, "momentum": 0.4989949571130713}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:40:59] INFO (NNIManager) Trial job O7UIq status changed from WAITING to FAILED [2022-08-25 23:41:00] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 7, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 1024, "lr": 0.01, "momentum": 0.11755641505627679}, "parameter_index": 0} [2022-08-25 23:41:05] INFO (NNIManager) submitTrialJob: form: { sequenceId: 7, hyperParameters: { value: '{"parameter_id": 7, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 1024, "lr": 0.01, "momentum": 0.11755641505627679}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:41:10] INFO (NNIManager) Trial job S975K status changed from WAITING to FAILED [2022-08-25 23:41:10] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 8, "parameter_source": "algorithm", "parameters": {"batch_size": 16, "hidden_size": 256, "lr": 0.0001, "momentum": 0.23273431921548116}, "parameter_index": 0} [2022-08-25 23:41:15] INFO (NNIManager) submitTrialJob: form: { sequenceId: 8, hyperParameters: { value: '{"parameter_id": 8, "parameter_source": "algorithm", "parameters": {"batch_size": 16, "hidden_size": 256, "lr": 0.0001, "momentum": 0.23273431921548116}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:41:16] ERROR (NNIRestHandler) Error: File not found: C:\Users\唐勇强\nni-experiments\jnl15sro\trials\XJ3S3\stderr at LocalTrainingService.getTrialFile (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\training_service\local\localTrainingService.js:146:19) at NNIManager.getTrialFile (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\core\nnimanager.js:333:37) at D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\rest_server\restHandler.js:284:29 at Layer.handle [as handle_request] (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\layer.js:95:5) at next (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\route.js:137:13) at Route.dispatch (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\route.js:112:3) at Layer.handle [as handle_request] (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\layer.js:95:5) at D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:281:22 at param (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:360:14) at param (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:371:14) [2022-08-25 23:41:20] INFO (NNIManager) Trial job i1Qt1 status changed from WAITING to FAILED [2022-08-25 23:41:20] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 9, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 256, "lr": 0.001, "momentum": 0.9310719853939587}, "parameter_index": 0} [2022-08-25 23:41:25] INFO (NNIManager) submitTrialJob: form: { sequenceId: 9, hyperParameters: { value: '{"parameter_id": 9, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 256, "lr": 0.001, "momentum": 0.9310719853939587}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-08-25 23:41:30] INFO (NNIManager) Trial job GKOE7 status changed from WAITING to FAILED [2022-08-25 23:41:30] INFO (NNIManager) Change NNIManager status from: RUNNING to: NO_MORE_TRIAL [2022-08-25 23:41:30] INFO (NNIManager) Change NNIManager status from: NO_MORE_TRIAL to: DONE [2022-08-25 23:41:30] INFO (NNIManager) Experiment done. [2022-08-26 00:00:27] ERROR (NNIRestHandler) Error: File not found: C:\Users\唐勇强\nni-experiments\jnl15sro\trials\XJ3S3\trial.log at LocalTrainingService.getTrialFile (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\training_service\local\localTrainingService.js:146:19) at NNIManager.getTrialFile (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\core\nnimanager.js:333:37) at D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\rest_server\restHandler.js:284:29 at Layer.handle [as handle_request] (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\layer.js:95:5) at next (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\route.js:137:13) at Route.dispatch (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\route.js:112:3) at Layer.handle [as handle_request] (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\layer.js:95:5) at D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:281:22 at param (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:360:14) at param (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:371:14) [2022-08-26 00:01:47] ERROR (NNIRestHandler) Error: File not found: C:\Users\唐勇强\nni-experiments\jnl15sro\trials\XJ3S3\stderr at LocalTrainingService.getTrialFile (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\training_service\local\localTrainingService.js:146:19) at NNIManager.getTrialFile (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\core\nnimanager.js:333:37) at D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\rest_server\restHandler.js:284:29 at Layer.handle [as handle_request] (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\layer.js:95:5) at next (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\route.js:137:13) at Route.dispatch (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\route.js:112:3) at Layer.handle [as handle_request] (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\layer.js:95:5) at D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:281:22 at param (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:360:14) at param (D:\PythonProgram\Anaconda3\lib\site-packages\nni_node\node_modules\express\lib\router\index.js:371:14)
-
dispatcher.log: [2022-08-25 23:39:42] INFO (numexpr.utils/MainThread) Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. [2022-08-25 23:39:42] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads. [2022-08-25 23:39:43] INFO (nni.tuner.tpe/MainThread) Using random seed 1939897128 [2022-08-25 23:39:43] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
-
nnictl_stderr: Experiment jnl15sro start: 2022-08-25 23:39:41.343385
-
nnictl_stdout: Experiment jnl15sro start: 2022-08-25 23:39:41.343385 run: $env:PATH='D:\PythonProgram\Anaconda3;D:\PythonProgram\Anaconda3\Library\mingw-w64\bin;D:\PythonProgram\Anaconda3\Library\usr\bin;D:\PythonProgram\Anaconda3\Library\bin;D:\PythonProgram\Anaconda3\Scripts;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;D:\MATLAB\R2021b\runtime\win64;D:\MATLAB\R2021b\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2021.1.0;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files (x86)\Microsoft SQL Server\90\Tools\binn;D:\AllSoftwares\Softwares\SVN\bin;C:\Users\唐勇强\AppData\Local\Microsoft\WindowsApps;;D:\PythonProgram\PyCharm Community Edition 2021.3.2\bin;' $env:NNI_PLATFORM="local" $env:NNI_EXP_ID="jnl15sro" $env:NNI_SYS_DIR="C:\Users\唐勇强\nni-experiments\jnl15sro\trials\YZUnM" $env:NNI_TRIAL_JOB_ID="YZUnM" $env:NNI_OUTPUT_DIR="C:\Users\唐勇强\nni-experiments\jnl15sro\trials\YZUnM" $env:NNI_TRIAL_SEQ_ID="1" $env:NNI_CODE_DIR="D:\深度学习\mnist" $PSDefaultParameterValues = @{'Out-File:Encoding' = 'utf8'} cd $env:NNI_CODE_DIR cmd.exe /c 'python mnist.py' 1>C:\Users\唐勇强\nni-experiments\jnl15sro\trials\YZUnM\stdout 2>C:\Users\唐勇强\nni-experiments\jnl15sro\trials\YZUnM\stderr $NOW_DATE = [int64](([datetime]::UtcNow)-(get-date "1/1/1970")).TotalSeconds $NOW_DATE = "$NOW_DATE" + (Get-Date -Format fff).ToString() Write $LASTEXITCODE " " $NOW_DATE | Out-File "C:\Users\唐勇强\nni-experiments\jnl15sro\trials\YZUnM.nni\state" -NoNewline -encoding utf8 powershell
I also met all trials failed in hpo tensorflow experiment. And when I change experiment.config.trial_command = 'python model.py'
to experiment.config.trial_command = 'python3 model.py'
in main.py
file, everything is normal. Maybe you could try it. Also, I think an english directory is better!
I also met all trials failed in hpo tensorflow experiment. And when I change
experiment.config.trial_command = 'python model.py'
toexperiment.config.trial_command = 'python3 model.py'
inmain.py
file, everything is normal. Maybe you could try it. Also, I think an english directory is better!
I changed experiment.config.trial_command = 'python mnist.py'
to experiment.config.trial_command = 'python3 model.py'
in mnist.py
, the experiment still failed.
And, I don't understand what the meaning of' I think an english directory is better!'.
Looks like trials failed for different reasons. english directory
means that log directory is english rather than chinese
Looks like trials failed for different reasons.
english directory
means that log directory is english rather than chinese
When I changed experimentWorkingDirectory: "C:\\Users\\唐勇强\\nni-experiments"
to experimentWorkingDirectory: "D:\\mnist-pytorch\\nni-experiments"
, trials run successfully. Thank you very much for your help!
Fixed in v2.10