inference Fail to build the docker for mlc command on Ubuntu 22.04

I failed to run the mlc command on Ubuntu 22.04:

mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes

Failed to resolve 'developer.download.nvidia.com' as below, but in fact I can access the link of developer.download.nvidia.com via firefox manually.

212.3 The following NEW packages will be installed:
212.3   libcublas-12-3 libcublas-dev-12-3
242.8 0 upgraded, 2 newly installed, 0 to remove and 87 not upgraded.
242.8 Need to get 514 MB of archives.
242.8 After this operation, 1577 MB of additional disk space will be used.
242.8 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
262.8 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
282.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
282.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
284.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
284.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
308.9 Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
308.9   Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 Err:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
328.9   Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-12-3_12.3.4.1-1_amd64.deb  Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-dev-12-3_12.3.4.1-1_amd64.deb  Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
------

 4 warnings found (use docker --debug to expand):
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 6)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 14)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 53)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 66)
Dockerfile.multi:32
--------------------
  31 |     COPY docker/common/install_tensorrt.sh install_tensorrt.sh
  32 | >>> RUN bash ./install_tensorrt.sh \
  33 | >>>     --TRT_VER=${TRT_VER} \
  34 | >>>     --CUDA_VER=${CUDA_VER} \
  35 | >>>     --CUDNN_VER=${CUDNN_VER} \
  36 | >>>     --NCCL_VER=${NCCL_VER} \
  37 | >>>     --CUBLAS_VER=${CUBLAS_VER} && \
  38 | >>>     rm install_tensorrt.sh
  39 |     
--------------------
ERROR: failed to solve: process "/bin/bash -c bash ./install_tensorrt.sh     --TRT_VER=${TRT_VER}     --CUDA_VER=${CUDA_VER}     --CUDNN_VER=${CUDNN_VER}     --NCCL_VER=${NCCL_VER}     --CUBLAS_VER=${CUBLAS_VER} &&     rm install_tensorrt.sh" did not complete successfully: exit code: 100
exit status 1
make: *** [Makefile:55: devel_build] Error 1
make: Leaving directory '/home/bob2/MLC/repos/local/cache/get-git-repo_d790359e/repo/docker'
Traceback (most recent call last):
  File "/home/bob2/mlc/bin/mlcr", line 8, in <module>
    sys.exit(mlcr())
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1715, in mlcr
    main()
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1797, in main
    res = method(run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
    return self.call_script_module_function("run", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1509, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
    r = self._run(i)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1768, in _run
    r = customize_code.preprocess(ii)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 284, in preprocess
    r = mlc.access(ii)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
    result = method(options)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1526, in docker
    return self.call_script_module_function("docker", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1511, in call_script_module_function
    result = automation_instance.docker(run_args)  # Pass args to the run method
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 4684, in docker
    return docker_run(self, i)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/docker.py", line 308, in docker_run
    r = self_module._run_deps(
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3695, in _run_deps
    r = self.action_object.access(ii)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
    result = method(options)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
    return self.call_script_module_function("run", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1519, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.main.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = get-ml-model-gptj, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.

Feb 25 '25 01:02 Bob123Yang

Can you please try the same command with --docker_cache=no?

Feb 25 '25 02:02 arjunsuresh

OK, I will try it and share the result later. Thanks.

Feb 25 '25 02:02 Bob123Yang

some error promt as below but the the progress is still going on:

...... ...... ......

Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script ammo-wf-exec is installed in '/home/bob2/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script evaluate-cli is installed in '/home/bob2/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cuda 23.10.0 requires pynvml<11.5,>=11.0.0, but you have pynvml 12.0.0 which is incompatible.
Successfully installed accelerate-0.25.0 bandit-1.7.7 build-1.2.2.post1 cfgv-3.4.0 colored-2.3.0 coloredlogs-15.0.1 coverage-7.6.12 datasets-3.3.2 diffusers-0.15.0 dill-0.3.8 distlib-0.3.9 evaluate-0.4.3 flatbuffers-25.2.10 graphviz-0.20.3 huggingface-hub-0.29.1 humanfriendly-10.0 identify-2.6.8 janus-2.0.0 lark-1.2.2 multiprocess-0.70.16 mypy-1.15.0 mypy_extensions-1.0.0 nltk-3.9.1 nodeenv-1.9.1 nvidia-ammo-0.7.4 nvidia-ml-py-12.570.86 onnx-graphsurgeon-0.5.5 onnxruntime-1.16.3 optimum-1.24.0 parameterized-0.9.0 pbr-6.1.1 pre-commit-4.1.0 py-1.11.0 pyarrow-19.0.1 pybind11-stubgen-2.5.3 pynvml-12.0.0 pyproject_hooks-1.2.0 pytest-cov-6.0.0 pytest-forked-1.6.0 requests-2.32.3 rouge_score-0.1.2 safetensors-0.5.2 sentencepiece-0.2.0 stevedore-5.4.1 tokenizers-0.15.2 tqdm-4.67.1 transformers-4.36.1 virtualenv-20.29.2 xxhash-3.5.0

[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info

...... ...... ......

[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs2Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int8b.cu.o
[ 98%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int4b.cu.o
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12932: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu.o] Error 2
gmake[3]: *** Waiting for unfinished jobs....
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12917: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_int32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12962: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_int32.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12947: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu.o] Error 2
[ 98%] Built target layers_src
[ 98%] Built target common_src
[ 98%] Built target runtime_src

Feb 25 '25 02:02 Bob123Yang

Quit after the above errors occurred.

mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes
--docker_cache=no

Feb 25 '25 02:02 Bob123Yang

oh. Which GPU are you running on?

Feb 25 '25 10:02 arjunsuresh

oh, I made the same mistake again - mix the GPU with different model.

Thank you @arjunsuresh , I will remove one and try it later.

Feb 26 '25 01:02 Bob123Yang

@arjunsuresh Unfortunately the complete same error as above happen and failed to build the docker.

please see the log here:

mlc-log.txt

Feb 26 '25 01:02 Bob123Yang