inference icon indicating copy to clipboard operation
inference copied to clipboard

Retinanet failed to launch on MLPerf Inference v5.0

Open Agalakdak opened this issue 7 months ago • 9 comments

I wanted to test my video card and suddenly encountered an error while building tets. Previously (on older versions) this benchmark was running successfully.

The site I used to run the benchmark https://docs.mlcommons.org/inference/benchmarks/object_detection/retinanet/

The commant

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=retinanet \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=500

GPU: 1 x H100 OS: Ubuntu 24.04.2

[2025-05-29 06:54:41,007 retinanet_graphsurgeon.py:264 INFO] Adding NMS layer nmsopt to the graph...
/home/mlcuser/.local/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /home/cmuser/CM/repos/local/cache/ac4a8632ea8a437d/pytorch/aten/src/ATen/native/TensorShape.cpp:3516.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/base.py", line 78, in run
    success = self.handle()
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/calibrate.py", line 62, in handle
    b.calibrate()
  File "/home/mlcuser/.local/lib/python3.8/site-packages/nvmitten/nvidia/builder.py", line 536, in calibrate
    self.mitten_builder.run(self.legacy_scratch, None)
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/retinanet/tensorrt/Retinanet.py", line 379, in run
    network = self.create_network(self.builder, subnetwork_name=subnet_name)
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/retinanet/tensorrt/Retinanet.py", line 223, in create_network
    success = parser.parse(onnx._serialize(model))
AttributeError: module 'onnx' has no attribute '_serialize'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/main.py", line 231, in <module>
    main(main_args, DETECTED_SYSTEM)
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/main.py", line 144, in main
    dispatch_action(main_args, config_dict, workload_setting)
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/main.py", line 202, in dispatch_action
    handler.run()
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/base.py", line 82, in run
    self.handle_failure()
  File "/home/mlcuser/MLC/repos/local/cache/get-git-repo_mlperf-inferenc_3505ed3d/repo/closed/NVIDIA/code/actionhandler/calibrate.py", line 68, in handle_failure
    raise RuntimeError("Calibration failed!")
RuntimeError: Calibration failed!
make: *** [Makefile:123: calibrate] Error 1
Traceback (most recent call last):
  File "/home/mlcuser/.local/bin/mlcr", line 8, in <module>
    sys.exit(mlcr())
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/main.py", line 86, in mlcr
    main()
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/main.py", line 273, in main
    res = method(run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
    r = self._run(i)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1857, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
    r = self.action_object.access(ii)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
    result = method(options)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
    r = self._run(i)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1642, in _run
    r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
    r = self.action_object.access(ii)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
    result = method(options)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
    r = self._run(i)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1642, in _run
    r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
    r = self.action_object.access(ii)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
    result = method(options)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 243, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.script_action.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = app-mlperf-inference-nvidia, return code = 256)

Agalakdak avatar May 29 '25 14:05 Agalakdak

@arjunsuresh to help^

nvzhihanj avatar May 30 '25 20:05 nvzhihanj

@arjunsuresh please help

Agalakdak avatar Jun 02 '25 04:06 Agalakdak

Hi @Agalakdak can you please do mlc pull repo and retry the command? If you are already inside the docker container, you can do this inside there only.

arjunsuresh avatar Jun 02 '25 13:06 arjunsuresh

Hi @arjunsuresh ! Hi! I tried to enter the command "mls pull repo" after the error message and it executed without problems. Then I entered the command

mlcr run-mlperf,inference,_full,_r5.0-dev \
   --model=retinanet \
   --implementation=nvidia \
   --framework=tensorrt \
   --category=edge \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

I received another error message

  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
    r = self._run(i)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1783, in _run
    r = customize_code.preprocess(ii)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 285, in preprocess
    r = mlc.access(ii)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
    result = method(options)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
    r = self._run(i)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1857, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3318, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
    r = self.action_object.access(ii)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
    result = method(options)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
    r = self._run(i)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1901, in _run
    r = self._run_deps(post_deps, clean_env_keys_post_deps, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
    r = self.action_object.access(ii)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
    result = method(options)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 231, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 226, in run
    r = self._run(i)
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1901, in _run
    r = self._run_deps(post_deps, clean_env_keys_post_deps, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3491, in _run_deps
    r = self.action_object.access(ii)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/action.py", line 57, in access
    result = method(options)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 307, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/.local/lib/python3.8/site-packages/mlc/script_action.py", line 243, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.script_action.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = benchmark-program, return code = 512)

Agalakdak avatar Jun 03 '25 10:06 Agalakdak

Hi @Agalakdak , the workflow is running fine for us when tested. Should be some segmentation fault as there is no additional information on error.

Could you please share mlc logs before the error arises?

Previously (on older versions) this benchmark was running successfully.

Could you confirm whether it is the same machine (1xH100)?

anandhu-eng avatar Jun 03 '25 14:06 anandhu-eng

Hi @anandhu-eng ! I'm sorry for the long answer. I couldn't answer for a long time for various reasons.

"Could you confirm whether it is the same machine (1xH100)?"

  • Yes it is

In short, I used the same commands and performed the same actions as in the previous post. I have reproduced the situation and am attaching the full log in the file. full_log.txt

P.S. I will try to reproduce this situation on other hardware (GPU P5000)

Agalakdak avatar Jun 05 '25 08:06 Agalakdak

@Agalakdak was the docker generated on the same system with H100? We have seen similar issues when GPU is swapped and in that case rebuilding the docker container made it work.

arjunsuresh avatar Jun 05 '25 18:06 arjunsuresh

@arjunsuresh "was the docker generated on the same system with H100"

  • Yes, the docker is generated on the same system. But I think I understand what's going on.

The system is currently booting via PXE. The system is a pre-configured image that already has drivers. Perhaps my image was originally configured on a different GPU. I'll try to install everything from scratch and report back in a few hours.

Agalakdak avatar Jun 06 '25 06:06 Agalakdak

@arjunsuresh

@anandhu-eng Could you confirm whether it is the same machine (1xH100)? Sorry for my previous answer. I just realized what you meant. This is my first time testing these benchmarks on an H100 GPU.

I continued my experiments, and they did not lead to a good result. There were errors with CUDA, then something else. I attach the last log below

full_log_2.txt

Hardware: the same OS: ubuntu 24.04 GPU: H100

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               Off |   00000000:98:00.0 Off |                    0 |
| N/A   23C    P0             45W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Agalakdak avatar Jun 06 '25 12:06 Agalakdak