Fail to build the docker for mlc command on Ubuntu 22.04
I failed to run the mlc command on Ubuntu 22.04:
mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes
Failed to resolve 'developer.download.nvidia.com' as below, but in fact I can access the link of developer.download.nvidia.com via firefox manually.
212.3 The following NEW packages will be installed:
212.3 libcublas-12-3 libcublas-dev-12-3
242.8 0 upgraded, 2 newly installed, 0 to remove and 87 not upgraded.
242.8 Need to get 514 MB of archives.
242.8 After this operation, 1577 MB of additional disk space will be used.
242.8 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-12-3 12.3.4.1-1
262.8 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-dev-12-3 12.3.4.1-1
282.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-12-3 12.3.4.1-1
282.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-dev-12-3 12.3.4.1-1
284.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-12-3 12.3.4.1-1
284.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-dev-12-3 12.3.4.1-1
308.9 Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-12-3 12.3.4.1-1
308.9 Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 Err:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 libcublas-dev-12-3 12.3.4.1-1
328.9 Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-12-3_12.3.4.1-1_amd64.deb Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-dev-12-3_12.3.4.1-1_amd64.deb Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
------
4 warnings found (use docker --debug to expand):
- FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 6)
- FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 14)
- FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 53)
- FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 66)
Dockerfile.multi:32
--------------------
31 | COPY docker/common/install_tensorrt.sh install_tensorrt.sh
32 | >>> RUN bash ./install_tensorrt.sh \
33 | >>> --TRT_VER=${TRT_VER} \
34 | >>> --CUDA_VER=${CUDA_VER} \
35 | >>> --CUDNN_VER=${CUDNN_VER} \
36 | >>> --NCCL_VER=${NCCL_VER} \
37 | >>> --CUBLAS_VER=${CUBLAS_VER} && \
38 | >>> rm install_tensorrt.sh
39 |
--------------------
ERROR: failed to solve: process "/bin/bash -c bash ./install_tensorrt.sh --TRT_VER=${TRT_VER} --CUDA_VER=${CUDA_VER} --CUDNN_VER=${CUDNN_VER} --NCCL_VER=${NCCL_VER} --CUBLAS_VER=${CUBLAS_VER} && rm install_tensorrt.sh" did not complete successfully: exit code: 100
exit status 1
make: *** [Makefile:55: devel_build] Error 1
make: Leaving directory '/home/bob2/MLC/repos/local/cache/get-git-repo_d790359e/repo/docker'
Traceback (most recent call last):
File "/home/bob2/mlc/bin/mlcr", line 8, in <module>
sys.exit(mlcr())
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1715, in mlcr
main()
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1797, in main
res = method(run_args)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
return self.call_script_module_function("run", run_args)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1509, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
r = self._run(i)
File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1768, in _run
r = customize_code.preprocess(ii)
File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 284, in preprocess
r = mlc.access(ii)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
result = method(options)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1526, in docker
return self.call_script_module_function("docker", run_args)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1511, in call_script_module_function
result = automation_instance.docker(run_args) # Pass args to the run method
File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 4684, in docker
return docker_run(self, i)
File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/docker.py", line 308, in docker_run
r = self_module._run_deps(
File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3695, in _run_deps
r = self.action_object.access(ii)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
result = method(options)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
return self.call_script_module_function("run", run_args)
File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1519, in call_script_module_function
raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.main.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = get-ml-model-gptj, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.
Can you please try the same command with --docker_cache=no?
OK, I will try it and share the result later. Thanks.
some error promt as below but the the progress is still going on:
...... ...... ......
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script ammo-wf-exec is installed in '/home/bob2/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script evaluate-cli is installed in '/home/bob2/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. dask-cuda 23.10.0 requires pynvml<11.5,>=11.0.0, but you have pynvml 12.0.0 which is incompatible. Successfully installed accelerate-0.25.0 bandit-1.7.7 build-1.2.2.post1 cfgv-3.4.0 colored-2.3.0 coloredlogs-15.0.1 coverage-7.6.12 datasets-3.3.2 diffusers-0.15.0 dill-0.3.8 distlib-0.3.9 evaluate-0.4.3 flatbuffers-25.2.10 graphviz-0.20.3 huggingface-hub-0.29.1 humanfriendly-10.0 identify-2.6.8 janus-2.0.0 lark-1.2.2 multiprocess-0.70.16 mypy-1.15.0 mypy_extensions-1.0.0 nltk-3.9.1 nodeenv-1.9.1 nvidia-ammo-0.7.4 nvidia-ml-py-12.570.86 onnx-graphsurgeon-0.5.5 onnxruntime-1.16.3 optimum-1.24.0 parameterized-0.9.0 pbr-6.1.1 pre-commit-4.1.0 py-1.11.0 pyarrow-19.0.1 pybind11-stubgen-2.5.3 pynvml-12.0.0 pyproject_hooks-1.2.0 pytest-cov-6.0.0 pytest-forked-1.6.0 requests-2.32.3 rouge_score-0.1.2 safetensors-0.5.2 sentencepiece-0.2.0 stevedore-5.4.1 tokenizers-0.15.2 tqdm-4.67.1 transformers-4.36.1 virtualenv-20.29.2 xxhash-3.5.0 [notice] A new release of pip is available: 23.3.1 -> 25.0.1 [notice] To update, run: python3 -m pip install --upgrade pip -- The CXX compiler identification is GNU 11.4.0 -- Detecting CXX compiler ABI info
...... ...... ......
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs2Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int8b.cu.o
[ 98%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int4b.cu.o
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
^
1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12932: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu.o] Error 2
gmake[3]: *** Waiting for unfinished jobs....
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
^
1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12917: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
^
1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_int32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12962: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_int32.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
^
1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12947: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu.o] Error 2
[ 98%] Built target layers_src
[ 98%] Built target common_src
[ 98%] Built target runtime_src
Quit after the above errors occurred.
mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes
--docker_cache=no
oh. Which GPU are you running on?
oh, I made the same mistake again - mix the GPU with different model.
Thank you @arjunsuresh , I will remove one and try it later.
@arjunsuresh Unfortunately the complete same error as above happen and failed to build the docker.
please see the log here: