nni
nni copied to clipboard
I cannot make it work on GPU for training
Description of the issue
I cannot run any experiment on GPU.
I have tried both with a Tesla P4, a P100 and a GTX 1060. I can only make it work using CPU only.
I have tried many configs with setting useActiveGpu to True or False, trialGpuNumber to 1, gpuIndices: '0'. However it always couldn't complete a single architecture training.
I have tried both outside and inside a Docker container.
Configuration
- Experiment config:
nni/examples/trials/mnist-pytorch/config.yml
Outside a Docker container
Environment
- NNI version: 3.0
- Training service: local
- Client OS: Debian 10
- Python version: 3.10.13
- PyTorch/TensorFlow version: 2.3.0+cu121
- Is conda/virtualenv/venv used?: yes
Log message
nnimanager.log
[2024-05-03 10:54:56] WARNING (pythonScript) Python command [nni.tools.nni_manager_scripts.collect_gpu_info] has stderr: Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 174, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 34, in main
print(json.dumps(data), flush=True)
File "/opt/conda/lib/python3.10/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/opt/conda/lib/python3.10/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/opt/conda/lib/python3.10/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/opt/conda/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bytes is not JSON serializable
[2024-05-03 10:54:56] INFO (ShutdownManager) Initiate shutdown: training service initialize failed
[2024-05-03 10:54:56] ERROR (GpuInfoCollector) Failed to collect GPU info, collector output:
[2024-05-03 10:54:56] ERROR (TrainingServiceCompat) Training srevice initialize failed: Error: TaskScheduler: Failed to collect GPU info
at TaskScheduler.init (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler/scheduler.js:16:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async TaskSchedulerClient.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler_client.js:20:13)
at async Promise.all (index 0)
at async TrialKeeper.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/keeper.js:48:9)
at async LocalTrainingServiceV3.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/local_v3/local.js:28:9)
at async V3asV1.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/v3/compat.js:235:29
There, the GPU's infos cannot be retreived.
experiment.log
[2024-05-03 13:52:31] INFO (nni.experiment) Starting web server...
[2024-05-03 13:52:32] INFO (nni.experiment) Setting up...
[2024-05-03 13:52:33] INFO (nni.experiment) Web portal URLs: http://127.0.0.1:8081 http://10.164.0.8:8081 http://172.17.0.1:8081
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping experiment, please wait...
[2024-05-03 13:53:03] INFO (nni.experiment) Saving experiment checkpoint...
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping NNI manager, if any...
[2024-05-03 13:53:23] ERROR (nni.experiment) HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
Traceback (most recent call last):
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/opt/conda/envs/nni/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/util.py", line 39, in reraise
raise value
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 539, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/experiment.py", line 171, in _stop_nni_manager
rest.delete(self.port, '/experiment', self.url_prefix)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 52, in delete
request('delete', port, api, prefix=prefix)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
[2024-05-03 13:53:23] WARNING (nni.experiment) Cannot gracefully stop experiment, killing NNI process...
There is a timeout since data cannot be retreived.
Inside a Docker container
Dockerfile
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04
ARG NNI_RELEASE
LABEL maintainer='Microsoft NNI Team<[email protected]>'
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get -y update
RUN apt-get -y install \
automake \
build-essential \
cmake \
curl \
git \
openssh-server \
python3 \
python3-dev \
python3-pip \
sudo \
unzip \
wget \
zip
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*
RUN ln -s python3 /usr/bin/python
RUN python3 -m pip --no-cache-dir install pip==22.0.3 setuptools==60.9.1 wheel==0.37.1
RUN python3 -m pip --no-cache-dir install \
lightgbm==3.3.2 \
numpy==1.22.2 \
pandas==1.4.1 \
scikit-learn==1.0.2 \
scipy==1.8.0
RUN python3 -m pip --no-cache-dir install \
torch==1.10.2+cu113 \
torchvision==0.11.3+cu113 \
torchaudio==0.10.2+cu113 \
-f https://download.pytorch.org/whl/cu113/torch_stable.html
RUN python3 -m pip --no-cache-dir install pytorch-lightning==1.6.1
RUN python3 -m pip --no-cache-dir install tensorflow==2.9.1
RUN python3 -m pip --no-cache-dir install azureml==0.2.7 azureml-sdk==1.38.0
# COPY dist/nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl .
# RUN python3 -m pip install nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl
# RUN rm nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl
ENV PATH=/root/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/sbin
WORKDIR /root
RUN pip install nni
RUN git clone https://github.com/microsoft/nni.git
RUN apt-get -y update
RUN apt-get -y install nano
Log message
nnimanager.log
root@1b02414e6d3e:~/nni-experiments/_latest/log# cat nnimanager.log
[2024-05-03 14:46:11] DEBUG (WsChannelServer.tuner) Start listening tuner/:channel
[2024-05-03 14:46:11] INFO (main) Start NNI manager
[2024-05-03 14:46:11] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-05-03 14:46:11] INFO (RestServer) REST server started.
[2024-05-03 14:46:11] DEBUG (SqlDB) Database directory: /root/nni-experiments/o21hdgqs/db
[2024-05-03 14:46:11] INFO (NNIDataStore) Datastore initialization done
[2024-05-03 14:46:11] DEBUG (main) start() returned.
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) POST: /experiment: body: {
experimentType: 'hpo',
searchSpaceFile: '/root/nni/examples/trials/mnist-pytorch/search_space.json',
searchSpace: {
batch_size: { _type: 'choice', _value: [Array] },
hidden_size: { _type: 'choice', _value: [Array] },
lr: { _type: 'choice', _value: [Array] },
momentum: { _type: 'uniform', _value: [Array] }
},
trialCommand: 'python3 mnist.py',
trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
trialConcurrency: 1,
trialGpuNumber: 1,
useAnnotation: false,
debug: false,
logLevel: 'info',
experimentWorkingDirectory: '/root/nni-experiments',
tuner: { name: 'TPE', classArgs: { optimize_mode: 'maximize' } },
trainingService: {
platform: 'local',
trialCommand: 'python3 mnist.py',
trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
trialGpuNumber: 1,
debug: false,
useActiveGpu: true,
maxTrialNumberPerGpu: 1,
reuseMode: false
}
}
[2024-05-03 14:46:12] INFO (NNIManager) Starting experiment: o21hdgqs
[2024-05-03 14:46:12] INFO (NNIManager) Setup training service...
[2024-05-03 14:46:12] DEBUG (LocalV3.local) Training sevice config: {
platform: 'local',
trialCommand: 'python3 mnist.py',
trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
trialGpuNumber: 1,
debug: false,
useActiveGpu: true,
maxTrialNumberPerGpu: 1,
reuseMode: false
}
[2024-05-03 14:46:12] INFO (NNIManager) Setup tuner...
[2024-05-03 14:46:12] DEBUG (NNIManager) dispatcher command: /usr/bin/python3,-m,nni,--exp_params,eyJleHBlcmltZW50VHlwZSI6ImhwbyIsInNlYXJjaFNwYWNlRmlsZSI6Ii9yb290L25uaS9leGFtcGxlcy90cmlhbHMvbW5pc3QtcHl0b3JjaC9zZWFyY2hfc3BhY2UuanNvbiIsInRyaWFsQ29tbWFuZCI6InB5dGhvbjMgbW5pc3QucHkiLCJ0cmlhbENvZGVEaXJlY3RvcnkiOiIvcm9vdC9ubmkvZXhhbXBsZXMvdHJpYWxzL21uaXN0LXB5dG9yY2giLCJ0cmlhbENvbmN1cnJlbmN5IjoxLCJ0cmlhbEdwdU51bWJlciI6MSwidXNlQW5ub3RhdGlvbiI6ZmFsc2UsImRlYnVnIjpmYWxzZSwibG9nTGV2ZWwiOiJpbmZvIiwiZXhwZXJpbWVudFdvcmtpbmdEaXJlY3RvcnkiOiIvcm9vdC9ubmktZXhwZXJpbWVudHMiLCJ0dW5lciI6eyJuYW1lIjoiVFBFIiwiY2xhc3NBcmdzIjp7Im9wdGltaXplX21vZGUiOiJtYXhpbWl6ZSJ9fSwidHJhaW5pbmdTZXJ2aWNlIjp7InBsYXRmb3JtIjoibG9jYWwiLCJ0cmlhbENvbW1hbmQiOiJweXRob24zIG1uaXN0LnB5IiwidHJpYWxDb2RlRGlyZWN0b3J5IjoiL3Jvb3Qvbm5pL2V4YW1wbGVzL3RyaWFscy9tbmlzdC1weXRvcmNoIiwidHJpYWxHcHVOdW1iZXIiOjEsImRlYnVnIjpmYWxzZSwidXNlQWN0aXZlR3B1Ijp0cnVlLCJtYXhUcmlhbE51bWJlclBlckdwdSI6MSwicmV1c2VNb2RlIjpmYWxzZX19
[2024-05-03 14:46:12] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-05-03 14:46:12] DEBUG (tuner_command_channel) Waiting connection...
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:46:13] DEBUG (WsChannelServer.tuner) Incoming connection __default__
[2024-05-03 14:46:13] DEBUG (WsChannel.__default__) Epoch 0 start
[2024-05-03 14:46:13] INFO (NNIManager) Add event listeners
[2024-05-03 14:46:13] DEBUG (NNIManager) Send tuner command: INITIALIZE: [object Object]
[2024-05-03 14:46:13] INFO (LocalV3.local) Start
[2024-05-03 14:46:13] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2024-05-03 14:46:13] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.001, "momentum": 0.47523697672790355}, "parameter_index": 0}
[2024-05-03 14:46:14] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 0,
hyperParameters: {
value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.001, "momentum": 0.47523697672790355}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-05-03 14:46:15] INFO (GpuInfoCollector) Forced update: {
gpuNumber: 1,
driverVersion: '550.54.15',
cudaVersion: 12040,
gpus: [
{
index: 0,
model: 'Tesla T4',
cudaCores: 2560,
gpuMemory: 16106127360,
freeGpuMemory: 15642263552,
gpuCoreUtilization: 0,
gpuMemoryUtilization: 0
}
],
processes: [],
success: true
}
[2024-05-03 14:46:17] INFO (LocalV3.local) Register directory trial_code = /root/nni/examples/trials/mnist-pytorch
experiment.log
root@1b02414e6d3e:~/nni-experiments/_latest/log# cat experiment.log
[2024-05-03 14:46:11] INFO (nni.experiment) Creating experiment, Experiment ID: o21hdgqs
[2024-05-03 14:46:11] INFO (nni.experiment) Starting web server...
[2024-05-03 14:46:12] INFO (nni.experiment) Setting up...
[2024-05-03 14:46:12] INFO (nni.experiment) Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-05-03 14:46:42] INFO (nni.experiment) Stopping experiment, please wait...
[2024-05-03 14:46:42] INFO (nni.experiment) Saving experiment checkpoint...
[2024-05-03 14:46:42] INFO (nni.experiment) Stopping NNI manager, if any...
When I'm using CPU only:
I obtain what I want using the GPU, the WebUI, the experiments trials, and so on...
root@6dcd2267cf44:~# nnictl create --config nni/examples/trials/mnist-pytorch/config.yml --foreground --debug
[2024-05-03 14:37:54] Creating experiment, Experiment ID: tcq192jf
[2024-05-03 14:37:54] Starting web server...
[2024-05-03 14:37:55] DEBUG (WsChannelServer.tuner) Start listening tuner/:channel
[2024-05-03 14:37:55] INFO (main) Start NNI manager
[2024-05-03 14:37:55] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-05-03 14:37:55] INFO (RestServer) REST server started.
[2024-05-03 14:37:55] DEBUG (SqlDB) Database directory: /root/nni-experiments/tcq192jf/db
[2024-05-03 14:37:55] INFO (NNIDataStore) Datastore initialization done
[2024-05-03 14:37:55] DEBUG (main) start() returned.
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:37:55] Setting up...
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) POST: /experiment: body: {
experimentType: 'hpo',
searchSpaceFile: '/root/nni/examples/trials/mnist-pytorch/search_space.json',
searchSpace: {
batch_size: { _type: 'choice', _value: [Array] },
hidden_size: { _type: 'choice', _value: [Array] },
lr: { _type: 'choice', _value: [Array] },
momentum: { _type: 'uniform', _value: [Array] }
},
trialCommand: 'python3 mnist.py',
trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
trialConcurrency: 1,
trialGpuNumber: 0,
useAnnotation: false,
debug: false,
logLevel: 'info',
experimentWorkingDirectory: '/root/nni-experiments',
tuner: { name: 'TPE', classArgs: { optimize_mode: 'maximize' } },
trainingService: {
platform: 'local',
trialCommand: 'python3 mnist.py',
trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
trialGpuNumber: 0,
debug: false,
maxTrialNumberPerGpu: 1,
reuseMode: false
}
}
[2024-05-03 14:37:55] INFO (NNIManager) Starting experiment: tcq192jf
[2024-05-03 14:37:55] INFO (NNIManager) Setup training service...
[2024-05-03 14:37:55] DEBUG (LocalV3.local) Training sevice config: {
platform: 'local',
trialCommand: 'python3 mnist.py',
trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
trialGpuNumber: 0,
debug: false,
maxTrialNumberPerGpu: 1,
reuseMode: false
}
[2024-05-03 14:37:55] INFO (NNIManager) Setup tuner...
[2024-05-03 14:37:55] DEBUG (NNIManager) dispatcher command: /usr/bin/python3,-m,nni,--exp_params,eyJleHBlcmltZW50VHlwZSI6ImhwbyIsInNlYXJjaFNwYWNlRmlsZSI6Ii9yb290L25uaS9leGFtcGxlcy90cmlhbHMvbW5pc3QtcHl0b3JjaC9zZWFyY2hfc3BhY2UuanNvbiIsInRyaWFsQ29tbWFuZCI6InB5dGhvbjMgbW5pc3QucHkiLCJ0cmlhbENvZGVEaXJlY3RvcnkiOiIvcm9vdC9ubmkvZXhhbXBsZXMvdHJpYWxzL21uaXN0LXB5dG9yY2giLCJ0cmlhbENvbmN1cnJlbmN5IjoxLCJ0cmlhbEdwdU51bWJlciI6MCwidXNlQW5ub3RhdGlvbiI6ZmFsc2UsImRlYnVnIjpmYWxzZSwibG9nTGV2ZWwiOiJpbmZvIiwiZXhwZXJpbWVudFdvcmtpbmdEaXJlY3RvcnkiOiIvcm9vdC9ubmktZXhwZXJpbWVudHMiLCJ0dW5lciI6eyJuYW1lIjoiVFBFIiwiY2xhc3NBcmdzIjp7Im9wdGltaXplX21vZGUiOiJtYXhpbWl6ZSJ9fSwidHJhaW5pbmdTZXJ2aWNlIjp7InBsYXRmb3JtIjoibG9jYWwiLCJ0cmlhbENvbW1hbmQiOiJweXRob24zIG1uaXN0LnB5IiwidHJpYWxDb2RlRGlyZWN0b3J5IjoiL3Jvb3Qvbm5pL2V4YW1wbGVzL3RyaWFscy9tbmlzdC1weXRvcmNoIiwidHJpYWxHcHVOdW1iZXIiOjAsImRlYnVnIjpmYWxzZSwibWF4VHJpYWxOdW1iZXJQZXJHcHUiOjEsInJldXNlTW9kZSI6ZmFsc2V9fQ==
[2024-05-03 14:37:55] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-05-03 14:37:55] DEBUG (tuner_command_channel) Waiting connection...
[2024-05-03 14:37:55] Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:37:57] DEBUG (WsChannelServer.tuner) Incoming connection __default__
[2024-05-03 14:37:57] DEBUG (WsChannel.__default__) Epoch 0 start
[2024-05-03 14:37:57] INFO (NNIManager) Add event listeners
[2024-05-03 14:37:57] DEBUG (NNIManager) Send tuner command: INITIALIZE: [object Object]
[2024-05-03 14:37:57] INFO (LocalV3.local) Start
[2024-05-03 14:37:57] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2024-05-03 14:37:57] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}
[2024-05-03 14:37:57] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 0,
hyperParameters: {
value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-05-03 14:37:58] INFO (LocalV3.local) Register directory trial_code = /root/nni/examples/trials/mnist-pytorch
[2024-05-03 14:37:58] INFO (LocalV3.local) Created trial wcvTY
[2024-05-03 14:38:00] INFO (LocalV3.local) Trial parameter: wcvTY {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}
[2024-05-03 14:38:05] DEBUG (NNIRestHandler) GET: /check-status: body: {}
...
How to reproduce it?
If from a Docker container:
docker build -t "nas-experiment" .
nvidia-docker run -it -p 8081:8081 nas-101-experiment
Then in both cases:
- I would both outside and inside a Docker container modify the file from /nni/example/trials/mnist-pytorch/config.yml in order to set the process on GPU.
- Then I would run the following command so I could see the logs in direct.
nnictl create --config /nni/example/trials/mnist-pytorch/config.yml --port 8081 --debug --foreground
As a result, the WebUI wouldn't start due to a timeout trying to retrive data, since the experiment won't load on GPU.
Notes
- I very available to answer and get helped on the subject as I currently work on NAS.
- I'm going to see what is ArchAI and how it differs from nii util I can use GPU for training there.
- I'm using GCP Instances to do this search
I made it using a devcontainer with version 2.7 of nni
Dockerfile
FROM msranni/nni:v2.7
RUN pip install matplotlib tensorflow_datasets dill
Still having problems with version 3.0
I have somewhat similar issue.
authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
builtinTunerName: Random
classArgs:
optimize_mode: minmize
trial:
command: python train.py
codeDir: .
when i work in this fashion, the code runs on CPU. But when I run the code as follow:
authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
builtinTunerName: Random
classArgs:
optimize_mode: minmize
trial:
command: python train.py
codeDir: .
gpuNum: 1
localConfig:
useActiveGpu: false
It creates 800+ python files and the link doesn't open anymore. It either crashes my PC (because of those multiple files) or the link will have Running 0. Why?
I am having the same problem as Rajesh90123.