DALI icon indicating copy to clipboard operation
DALI copied to clipboard

Add error message when GPU is not available

Open szalpal opened this issue 1 year ago • 3 comments

Category:

Other (e.g. Documentation, Tests, Configuration)

Description:

Currently, when DALI pipeline is created in Triton, but user forgets to pass --gpus flag to the run command, he gets an obscure error message:

dlopen libcuda.so failed!. Please install GPU dirverTraceback (most recent call last):
  File "<string>", line 8, in <module>
  File "/opt/tritonserver/backends/dali/conda/envs/dalienv/lib/python3.10/site-packages/nvidia/dali/_utils/autoserialize.py", line 77, in invoke_autoserialize
    dali_pipeline().serialize(filename=filename)
  File "/opt/tritonserver/backends/dali/conda/envs/dalienv/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1261, in serialize
    self._init_pipeline_backend()
  File "/opt/tritonserver/backends/dali/conda/envs/dalienv/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 725, in _init_pipeline_backend
    self._pipe = b.Pipeline(self._max_batch_size,
RuntimeError: [/opt/dali/dali/core/device_guard.cc:31] Assert on "cuInitChecked()" failed: Failed to load libcuda.so. Check your library paths and if the driver is installed correctly.
Stacktrace (31 entries):
[frame 0]: /opt/tritonserver/backends/dali/conda/envs/dalienv/lib/python3.10/site-packages/nvidia/dali/libdali_core.so(+0x233fb) [0x7fed69bb13fb]
[frame 1]: /opt/tritonserver/backends/dali/conda/envs/dalienv/lib/python3.10/site-packages/nvidia/dali/libdali_core.so(dali::DeviceGuard::DeviceGuard(int)+0x1a8) [0x7fed69bd4548]
[frame 2]: /opt/tritonserver/backends/dali/conda/envs/dalienv/lib/python3.10/site-packages/nvidia/dali/libdali.so(dali::Pipeline::Init(int, int, int, long, bool, bool, bool, unsigned long, bool, int, int, dali::QueueSizes)+0x50) [0x7fed6f81a620]
[frame 3]: /opt/tritonserver/backends/dali/conda/envs/dalienv/lib/python3.10/site-packages/nvidia/dali/backend_impl.cpython-310-x86_64-linux-gnu.so(dali::Pipeline::Pipeline(int, int, int, long, bool, int, bool, unsigned long, bool, int, int)+0x363) [0x7fed64c05a53]

This PR introduces more descriptive error message.

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • [ ] Existing tests apply
  • [ ] New tests added
    • [ ] Python tests
    • [ ] GTests
    • [ ] Benchmark
    • [ ] Other
  • [ ] N/A

Checklist

Documentation

  • [ ] Existing documentation applies
  • [ ] Documentation updated
    • [ ] Docstring
    • [ ] Doxygen
    • [ ] RST
    • [ ] Jupyter
    • [ ] Other
  • [ ] N/A

DALI team only

Requirements

  • [ ] Implements new requirements
  • [ ] Affects existing requirements
  • [ ] N/A

REQ IDs: N/A

JIRA TASK: N/A

szalpal avatar Feb 19 '24 11:02 szalpal

!build

szalpal avatar Feb 19 '24 12:02 szalpal

CI MESSAGE: [12924987]: BUILD STARTED

dali-automaton avatar Feb 19 '24 12:02 dali-automaton

CI MESSAGE: [12924987]: BUILD FAILED

dali-automaton avatar Feb 19 '24 15:02 dali-automaton

!build

szalpal avatar Feb 22 '24 12:02 szalpal

!build

szalpal avatar Feb 22 '24 12:02 szalpal

CI MESSAGE: [13002156]: BUILD STARTED

dali-automaton avatar Feb 22 '24 13:02 dali-automaton

CI MESSAGE: [13002210]: BUILD STARTED

dali-automaton avatar Feb 22 '24 13:02 dali-automaton

CI MESSAGE: [13002156]: BUILD PASSED

dali-automaton avatar Feb 22 '24 15:02 dali-automaton

CI MESSAGE: [13002210]: BUILD FAILED

dali-automaton avatar Feb 22 '24 15:02 dali-automaton

CI MESSAGE: [13002210]: BUILD PASSED

dali-automaton avatar Feb 23 '24 11:02 dali-automaton