nni I cannot make it work on GPU for training

Description of the issue

I cannot run any experiment on GPU.

I have tried both with a Tesla P4, a P100 and a GTX 1060. I can only make it work using CPU only.

I have tried many configs with setting useActiveGpu to True or False, trialGpuNumber to 1, gpuIndices: '0'. However it always couldn't complete a single architecture training.

I have tried both outside and inside a Docker container.

Configuration

Experiment config: nni/examples/trials/mnist-pytorch/config.yml

Outside a Docker container

Environment

NNI version: 3.0
Training service: local
Client OS: Debian 10
Python version: 3.10.13
PyTorch/TensorFlow version: 2.3.0+cu121
Is conda/virtualenv/venv used?: yes

Log message

nnimanager.log

[2024-05-03 10:54:56] WARNING (pythonScript) Python command [nni.tools.nni_manager_scripts.collect_gpu_info] has stderr: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 174, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 34, in main
    print(json.dumps(data), flush=True)
  File "/opt/conda/lib/python3.10/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/opt/conda/lib/python3.10/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/conda/lib/python3.10/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/opt/conda/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bytes is not JSON serializable
 
[2024-05-03 10:54:56] INFO (ShutdownManager) Initiate shutdown: training service initialize failed
[2024-05-03 10:54:56] ERROR (GpuInfoCollector) Failed to collect GPU info, collector output: 
[2024-05-03 10:54:56] ERROR (TrainingServiceCompat) Training srevice initialize failed: Error: TaskScheduler: Failed to collect GPU info
    at TaskScheduler.init (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler/scheduler.js:16:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async TaskSchedulerClient.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler_client.js:20:13)
    at async Promise.all (index 0)
    at async TrialKeeper.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/keeper.js:48:9)
    at async LocalTrainingServiceV3.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/local_v3/local.js:28:9)
    at async V3asV1.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/v3/compat.js:235:29

There, the GPU's infos cannot be retreived.

experiment.log

[2024-05-03 13:52:31] INFO (nni.experiment) Starting web server...
[2024-05-03 13:52:32] INFO (nni.experiment) Setting up...
[2024-05-03 13:52:33] INFO (nni.experiment) Web portal URLs: http://127.0.0.1:8081 http://10.164.0.8:8081 http://172.17.0.1:8081
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping experiment, please wait...
[2024-05-03 13:53:03] INFO (nni.experiment) Saving experiment checkpoint...
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping NNI manager, if any...
[2024-05-03 13:53:23] ERROR (nni.experiment) HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
Traceback (most recent call last):
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
  File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 281, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/conda/envs/nni/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/experiment.py", line 171, in _stop_nni_manager
    rest.delete(self.port, '/experiment', self.url_prefix)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 52, in delete
    request('delete', port, api, prefix=prefix)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
[2024-05-03 13:53:23] WARNING (nni.experiment) Cannot gracefully stop experiment, killing NNI process...

There is a timeout since data cannot be retreived.

Inside a Docker container

Dockerfile

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04

ARG NNI_RELEASE

LABEL maintainer='Microsoft NNI Team<[email protected]>'

ENV DEBIAN_FRONTEND=noninteractive 

RUN apt-get -y update
RUN apt-get -y install \
    automake \
    build-essential \
    cmake \
    curl \
    git \
    openssh-server \
    python3 \
    python3-dev \
    python3-pip \
    sudo \
    unzip \
    wget \
    zip
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*

RUN ln -s python3 /usr/bin/python

RUN python3 -m pip --no-cache-dir install pip==22.0.3 setuptools==60.9.1 wheel==0.37.1

RUN python3 -m pip --no-cache-dir install \
    lightgbm==3.3.2 \
    numpy==1.22.2 \
    pandas==1.4.1 \
    scikit-learn==1.0.2 \
    scipy==1.8.0

RUN python3 -m pip --no-cache-dir install \
    torch==1.10.2+cu113 \
    torchvision==0.11.3+cu113 \
    torchaudio==0.10.2+cu113 \
    -f https://download.pytorch.org/whl/cu113/torch_stable.html
RUN python3 -m pip --no-cache-dir install pytorch-lightning==1.6.1

RUN python3 -m pip --no-cache-dir install tensorflow==2.9.1

RUN python3 -m pip --no-cache-dir install azureml==0.2.7 azureml-sdk==1.38.0

# COPY dist/nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl .
# RUN python3 -m pip install nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl
# RUN rm nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl

ENV PATH=/root/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/sbin

WORKDIR /root

RUN pip install nni
RUN git clone https://github.com/microsoft/nni.git

RUN apt-get -y update
RUN apt-get -y install nano

Log message

nnimanager.log

root@1b02414e6d3e:~/nni-experiments/_latest/log# cat nnimanager.log 
[2024-05-03 14:46:11] DEBUG (WsChannelServer.tuner) Start listening tuner/:channel
[2024-05-03 14:46:11] INFO (main) Start NNI manager
[2024-05-03 14:46:11] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-05-03 14:46:11] INFO (RestServer) REST server started.
[2024-05-03 14:46:11] DEBUG (SqlDB) Database directory: /root/nni-experiments/o21hdgqs/db
[2024-05-03 14:46:11] INFO (NNIDataStore) Datastore initialization done
[2024-05-03 14:46:11] DEBUG (main) start() returned.
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) POST: /experiment: body: {
  experimentType: 'hpo',
  searchSpaceFile: '/root/nni/examples/trials/mnist-pytorch/search_space.json',
  searchSpace: {
    batch_size: { _type: 'choice', _value: [Array] },
    hidden_size: { _type: 'choice', _value: [Array] },
    lr: { _type: 'choice', _value: [Array] },
    momentum: { _type: 'uniform', _value: [Array] }
  },
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialConcurrency: 1,
  trialGpuNumber: 1,
  useAnnotation: false,
  debug: false,
  logLevel: 'info',
  experimentWorkingDirectory: '/root/nni-experiments',
  tuner: { name: 'TPE', classArgs: { optimize_mode: 'maximize' } },
  trainingService: {
    platform: 'local',
    trialCommand: 'python3 mnist.py',
    trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
    trialGpuNumber: 1,
    debug: false,
    useActiveGpu: true,
    maxTrialNumberPerGpu: 1,
    reuseMode: false
  }
}
[2024-05-03 14:46:12] INFO (NNIManager) Starting experiment: o21hdgqs
[2024-05-03 14:46:12] INFO (NNIManager) Setup training service...
[2024-05-03 14:46:12] DEBUG (LocalV3.local) Training sevice config: {
  platform: 'local',
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialGpuNumber: 1,
  debug: false,
  useActiveGpu: true,
  maxTrialNumberPerGpu: 1,
  reuseMode: false
}
[2024-05-03 14:46:12] INFO (NNIManager) Setup tuner...
[2024-05-03 14:46:12] DEBUG (NNIManager) dispatcher command: /usr/bin/python3,-m,nni,--exp_params,eyJleHBlcmltZW50VHlwZSI6ImhwbyIsInNlYXJjaFNwYWNlRmlsZSI6Ii9yb290L25uaS9leGFtcGxlcy90cmlhbHMvbW5pc3QtcHl0b3JjaC9zZWFyY2hfc3BhY2UuanNvbiIsInRyaWFsQ29tbWFuZCI6InB5dGhvbjMgbW5pc3QucHkiLCJ0cmlhbENvZGVEaXJlY3RvcnkiOiIvcm9vdC9ubmkvZXhhbXBsZXMvdHJpYWxzL21uaXN0LXB5dG9yY2giLCJ0cmlhbENvbmN1cnJlbmN5IjoxLCJ0cmlhbEdwdU51bWJlciI6MSwidXNlQW5ub3RhdGlvbiI6ZmFsc2UsImRlYnVnIjpmYWxzZSwibG9nTGV2ZWwiOiJpbmZvIiwiZXhwZXJpbWVudFdvcmtpbmdEaXJlY3RvcnkiOiIvcm9vdC9ubmktZXhwZXJpbWVudHMiLCJ0dW5lciI6eyJuYW1lIjoiVFBFIiwiY2xhc3NBcmdzIjp7Im9wdGltaXplX21vZGUiOiJtYXhpbWl6ZSJ9fSwidHJhaW5pbmdTZXJ2aWNlIjp7InBsYXRmb3JtIjoibG9jYWwiLCJ0cmlhbENvbW1hbmQiOiJweXRob24zIG1uaXN0LnB5IiwidHJpYWxDb2RlRGlyZWN0b3J5IjoiL3Jvb3Qvbm5pL2V4YW1wbGVzL3RyaWFscy9tbmlzdC1weXRvcmNoIiwidHJpYWxHcHVOdW1iZXIiOjEsImRlYnVnIjpmYWxzZSwidXNlQWN0aXZlR3B1Ijp0cnVlLCJtYXhUcmlhbE51bWJlclBlckdwdSI6MSwicmV1c2VNb2RlIjpmYWxzZX19
[2024-05-03 14:46:12] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-05-03 14:46:12] DEBUG (tuner_command_channel) Waiting connection...
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:46:13] DEBUG (WsChannelServer.tuner) Incoming connection __default__
[2024-05-03 14:46:13] DEBUG (WsChannel.__default__) Epoch 0 start
[2024-05-03 14:46:13] INFO (NNIManager) Add event listeners
[2024-05-03 14:46:13] DEBUG (NNIManager) Send tuner command: INITIALIZE: [object Object]
[2024-05-03 14:46:13] INFO (LocalV3.local) Start
[2024-05-03 14:46:13] INFO (NNIManager) NNIManager received command from dispatcher: ID, 
[2024-05-03 14:46:13] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.001, "momentum": 0.47523697672790355}, "parameter_index": 0}
[2024-05-03 14:46:14] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 0,
  hyperParameters: {
    value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.001, "momentum": 0.47523697672790355}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2024-05-03 14:46:15] INFO (GpuInfoCollector) Forced update: {
  gpuNumber: 1,
  driverVersion: '550.54.15',
  cudaVersion: 12040,
  gpus: [
    {
      index: 0,
      model: 'Tesla T4',
      cudaCores: 2560,
      gpuMemory: 16106127360,
      freeGpuMemory: 15642263552,
      gpuCoreUtilization: 0,
      gpuMemoryUtilization: 0
    }
  ],
  processes: [],
  success: true
}
[2024-05-03 14:46:17] INFO (LocalV3.local) Register directory trial_code = /root/nni/examples/trials/mnist-pytorch

experiment.log

root@1b02414e6d3e:~/nni-experiments/_latest/log# cat experiment.log 
[2024-05-03 14:46:11] INFO (nni.experiment) Creating experiment, Experiment ID: o21hdgqs
[2024-05-03 14:46:11] INFO (nni.experiment) Starting web server...
[2024-05-03 14:46:12] INFO (nni.experiment) Setting up...
[2024-05-03 14:46:12] INFO (nni.experiment) Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-05-03 14:46:42] INFO (nni.experiment) Stopping experiment, please wait...
[2024-05-03 14:46:42] INFO (nni.experiment) Saving experiment checkpoint...
[2024-05-03 14:46:42] INFO (nni.experiment) Stopping NNI manager, if any...

When I'm using CPU only:

I obtain what I want using the GPU, the WebUI, the experiments trials, and so on...

root@6dcd2267cf44:~# nnictl create --config nni/examples/trials/mnist-pytorch/config.yml --foreground --debug
[2024-05-03 14:37:54] Creating experiment, Experiment ID: tcq192jf
[2024-05-03 14:37:54] Starting web server...
[2024-05-03 14:37:55] DEBUG (WsChannelServer.tuner) Start listening tuner/:channel
[2024-05-03 14:37:55] INFO (main) Start NNI manager
[2024-05-03 14:37:55] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-05-03 14:37:55] INFO (RestServer) REST server started.
[2024-05-03 14:37:55] DEBUG (SqlDB) Database directory: /root/nni-experiments/tcq192jf/db
[2024-05-03 14:37:55] INFO (NNIDataStore) Datastore initialization done
[2024-05-03 14:37:55] DEBUG (main) start() returned.
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:37:55] Setting up...
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) POST: /experiment: body: {
  experimentType: 'hpo',
  searchSpaceFile: '/root/nni/examples/trials/mnist-pytorch/search_space.json',
  searchSpace: {
    batch_size: { _type: 'choice', _value: [Array] },
    hidden_size: { _type: 'choice', _value: [Array] },
    lr: { _type: 'choice', _value: [Array] },
    momentum: { _type: 'uniform', _value: [Array] }
  },
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialConcurrency: 1,
  trialGpuNumber: 0,
  useAnnotation: false,
  debug: false,
  logLevel: 'info',
  experimentWorkingDirectory: '/root/nni-experiments',
  tuner: { name: 'TPE', classArgs: { optimize_mode: 'maximize' } },
  trainingService: {
    platform: 'local',
    trialCommand: 'python3 mnist.py',
    trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
    trialGpuNumber: 0,
    debug: false,
    maxTrialNumberPerGpu: 1,
    reuseMode: false
  }
}
[2024-05-03 14:37:55] INFO (NNIManager) Starting experiment: tcq192jf
[2024-05-03 14:37:55] INFO (NNIManager) Setup training service...
[2024-05-03 14:37:55] DEBUG (LocalV3.local) Training sevice config: {
  platform: 'local',
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialGpuNumber: 0,
  debug: false,
  maxTrialNumberPerGpu: 1,
  reuseMode: false
}
[2024-05-03 14:37:55] INFO (NNIManager) Setup tuner...
[2024-05-03 14:37:55] DEBUG (NNIManager) dispatcher command: /usr/bin/python3,-m,nni,--exp_params,eyJleHBlcmltZW50VHlwZSI6ImhwbyIsInNlYXJjaFNwYWNlRmlsZSI6Ii9yb290L25uaS9leGFtcGxlcy90cmlhbHMvbW5pc3QtcHl0b3JjaC9zZWFyY2hfc3BhY2UuanNvbiIsInRyaWFsQ29tbWFuZCI6InB5dGhvbjMgbW5pc3QucHkiLCJ0cmlhbENvZGVEaXJlY3RvcnkiOiIvcm9vdC9ubmkvZXhhbXBsZXMvdHJpYWxzL21uaXN0LXB5dG9yY2giLCJ0cmlhbENvbmN1cnJlbmN5IjoxLCJ0cmlhbEdwdU51bWJlciI6MCwidXNlQW5ub3RhdGlvbiI6ZmFsc2UsImRlYnVnIjpmYWxzZSwibG9nTGV2ZWwiOiJpbmZvIiwiZXhwZXJpbWVudFdvcmtpbmdEaXJlY3RvcnkiOiIvcm9vdC9ubmktZXhwZXJpbWVudHMiLCJ0dW5lciI6eyJuYW1lIjoiVFBFIiwiY2xhc3NBcmdzIjp7Im9wdGltaXplX21vZGUiOiJtYXhpbWl6ZSJ9fSwidHJhaW5pbmdTZXJ2aWNlIjp7InBsYXRmb3JtIjoibG9jYWwiLCJ0cmlhbENvbW1hbmQiOiJweXRob24zIG1uaXN0LnB5IiwidHJpYWxDb2RlRGlyZWN0b3J5IjoiL3Jvb3Qvbm5pL2V4YW1wbGVzL3RyaWFscy9tbmlzdC1weXRvcmNoIiwidHJpYWxHcHVOdW1iZXIiOjAsImRlYnVnIjpmYWxzZSwibWF4VHJpYWxOdW1iZXJQZXJHcHUiOjEsInJldXNlTW9kZSI6ZmFsc2V9fQ==
[2024-05-03 14:37:55] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-05-03 14:37:55] DEBUG (tuner_command_channel) Waiting connection...
[2024-05-03 14:37:55] Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:37:57] DEBUG (WsChannelServer.tuner) Incoming connection __default__
[2024-05-03 14:37:57] DEBUG (WsChannel.__default__) Epoch 0 start
[2024-05-03 14:37:57] INFO (NNIManager) Add event listeners
[2024-05-03 14:37:57] DEBUG (NNIManager) Send tuner command: INITIALIZE: [object Object]
[2024-05-03 14:37:57] INFO (LocalV3.local) Start
[2024-05-03 14:37:57] INFO (NNIManager) NNIManager received command from dispatcher: ID, 
[2024-05-03 14:37:57] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}
[2024-05-03 14:37:57] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 0,
  hyperParameters: {
    value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2024-05-03 14:37:58] INFO (LocalV3.local) Register directory trial_code = /root/nni/examples/trials/mnist-pytorch
[2024-05-03 14:37:58] INFO (LocalV3.local) Created trial wcvTY
[2024-05-03 14:38:00] INFO (LocalV3.local) Trial parameter: wcvTY {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}
[2024-05-03 14:38:05] DEBUG (NNIRestHandler) GET: /check-status: body: {}
...

How to reproduce it?

If from a Docker container:

docker build -t "nas-experiment" .
nvidia-docker run -it -p 8081:8081 nas-101-experiment

Then in both cases:

I would both outside and inside a Docker container modify the file from /nni/example/trials/mnist-pytorch/config.yml in order to set the process on GPU.
Then I would run the following command so I could see the logs in direct.

nnictl create --config /nni/example/trials/mnist-pytorch/config.yml --port 8081 --debug --foreground

As a result, the WebUI wouldn't start due to a timeout trying to retrive data, since the experiment won't load on GPU.

Notes

I very available to answer and get helped on the subject as I currently work on NAS.
I'm going to see what is ArchAI and how it differs from nii util I can use GPU for training there.
I'm using GCP Instances to do this search

May 03 '24 14:05 dtamienER

I made it using a devcontainer with version 2.7 of nni

Dockerfile

FROM msranni/nni:v2.7

RUN pip install matplotlib tensorflow_datasets dill

Still having problems with version 3.0

May 16 '24 17:05 dtamienER

I have somewhat similar issue.

authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
  builtinTunerName: Random
  classArgs:
    optimize_mode: minmize
trial:
  command: python train.py
  codeDir: .

when i work in this fashion, the code runs on CPU. But when I run the code as follow:

authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
  builtinTunerName: Random
  classArgs:
    optimize_mode: minmize
trial:
  command: python train.py
  codeDir: .
  gpuNum: 1
localConfig:
  useActiveGpu: false

It creates 800+ python files and the link doesn't open anymore. It either crashes my PC (because of those multiple files) or the link will have Running 0. Why?

May 22 '24 11:05 Rajesh90123

I am having the same problem as Rajesh90123.

May 23 '24 15:05 msasen

nni nni copied to clipboard

I cannot make it work on GPU for training

Description of the issue

Configuration

Outside a Docker container

Environment

Log message

nnimanager.log

experiment.log

Inside a Docker container

Dockerfile

Log message

nnimanager.log

experiment.log

When I'm using CPU only:

How to reproduce it?

Notes

nni
nni copied to clipboard