Retinanet failed to launch on MLPerf Inference v5.0
I wanted to test my video card and suddenly encountered an error while building tets. Previously (on older versions) this benchmark was running successfully.
The site I used to run the benchmark https://docs.mlcommons.org/inference/benchmarks/object_detection/retinanet/
The commant
mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
--model=retinanet \
--implementation=nvidia \
--framework=tensorrt \
--category=edge \
--scenario=Offline \
--execution_mode=test \
--device=cuda \
--docker --quiet \
--test_query_count=500
GPU: 1 x H100 OS: Ubuntu 24.04.2
[2025-05-29 06:54:41,007 retinanet_graphsurgeon.py:264 INFO] Adding NMS layer nmsopt to the graph...
/home/mlcuser/.local/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /home/cmuser/CM/repos/local/cache/ac4a8632ea8a437d/pytorch/aten/src/ATen/native/TensorShape.cpp:3516.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Traceback (most recent call last):
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/base.py", line 78, in run
success = self.handle()
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/calibrate.py", line 62, in handle
b.calibrate()
File "/home/mlcuser/.local/lib/python3.8/site-packages/nvmitten/nvidia/builder.py", line 536, in calibrate
self.mitten_builder.run(self.legacy_scratch, None)
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/retinanet/tensorrt/Retinanet.py", line 379, in run
network = self.create_network(self.builder, subnetwork_name=subnet_name)
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/retinanet/tensorrt/Retinanet.py", line 223, in create_network
success = parser.parse(onnx._serialize(model))
AttributeError: module 'onnx' has no attribute '_serialize'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/main.py", line 231, in <module>
main(main_args, DETECTED_SYSTEM)
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/main.py", line 144, in main
dispatch_action(main_args, config_dict, workload_setting)
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/main.py", line 202, in dispatch_action
handler.run()
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/base.py", line 82, in run
self.handle_failure()
File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/calibrate.py", line 68, in handle_failure
raise RuntimeError("Calibration failed!")
RuntimeError: Calibration failed!
make: *** [Makefile:123: calibrate] Error 1
Traceback (most recent call last):
File "/home/mlcuser/.local/bin/mlcr", line 8, in <module>
sys.exit(mlcr())
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/main.py", line 86, in mlcr
main()
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/main.py", line 273, in main
res = method(run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
r = self._run(i)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1857, in _run
r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
r = self.action_object.access(ii)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
result = method(options)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
r = self._run(i)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1642, in _run
r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
r = self.action_object.access(ii)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
result = method(options)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
r = self._run(i)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1642, in _run
r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
r = self.action_object.access(ii)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
result = method(options)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 243, in call_script_module_function
raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.script_action.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = app-mlperf-inference-nvidia, return code = 256)
@arjunsuresh to help^
@arjunsuresh please help
Hi @Agalakdak can you please do mlc pull repo and retry the command? If you are already inside the docker container, you can do this inside there only.
Hi @arjunsuresh ! Hi! I tried to enter the command "mls pull repo" after the error message and it executed without problems. Then I entered the command
mlcr run-mlperf,inference,_full,_r5.0-dev \
--model=retinanet \
--implementation=nvidia \
--framework=tensorrt \
--category=edge \
--scenario=Offline \
--execution_mode=valid \
--device=cuda \
--quiet
I received another error message
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
r = self._run(i)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1783, in _run
r = customize_code.preprocess(ii)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 285, in preprocess
r = mlc.access(ii)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
result = method(options)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
r = self._run(i)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1857, in _run
r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
r = self.action_object.access(ii)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
result = method(options)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
r = self._run(i)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1901, in _run
r = self._run_deps(post_deps, clean_env_keys_post_deps, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
r = self.action_object.access(ii)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
result = method(options)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
r = self._run(i)
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1901, in _run
r = self._run_deps(post_deps, clean_env_keys_post_deps, env, state, const, const_state, add_deps_recursive, recursion_spaces,
File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
r = self.action_object.access(ii)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
result = method(options)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
return self.call_script_module_function("run", run_args)
File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 243, in call_script_module_function
raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.script_action.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = benchmark-program, return code = 512)
Hi @Agalakdak , the workflow is running fine for us when tested. Should be some segmentation fault as there is no additional information on error.
Could you please share mlc logs before the error arises?
Previously (on older versions) this benchmark was running successfully.
Could you confirm whether it is the same machine (1xH100)?
Hi @anandhu-eng ! I'm sorry for the long answer. I couldn't answer for a long time for various reasons.
"Could you confirm whether it is the same machine (1xH100)?"
- Yes it is
In short, I used the same commands and performed the same actions as in the previous post. I have reproduced the situation and am attaching the full log in the file. full_log.txt
P.S. I will try to reproduce this situation on other hardware (GPU P5000)
@Agalakdak was the docker generated on the same system with H100? We have seen similar issues when GPU is swapped and in that case rebuilding the docker container made it work.
@arjunsuresh "was the docker generated on the same system with H100"
- Yes, the docker is generated on the same system. But I think I understand what's going on.
The system is currently booting via PXE. The system is a pre-configured image that already has drivers. Perhaps my image was originally configured on a different GPU. I'll try to install everything from scratch and report back in a few hours.
@arjunsuresh
@anandhu-eng Could you confirm whether it is the same machine (1xH100)? Sorry for my previous answer. I just realized what you meant. This is my first time testing these benchmarks on an H100 GPU.
I continued my experiments, and they did not lead to a good result. There were errors with CUDA, then something else. I attach the last log below
Hardware: the same OS: ubuntu 24.04 GPU: H100
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:98:00.0 Off | 0 |
| N/A 23C P0 45W / 350W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+