cuda-python icon indicating copy to clipboard operation
cuda-python copied to clipboard

[BUG]: Randomly recurring test_cufile.py::test_get_stats_l3 Segmentation faults

Open rwgk opened this issue 2 months ago • 1 comments

Is this a duplicate?

  • [x] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Runtime Error

Component

cuda.bindings

Describe the bug

In routine testing on a bare-metal (NOT WSL) Ubuntu 24.04 linux-64 workstation I'm seeing randomly recurring test_cufile.py::test_get_stats_l3 Segmentation faults, e.g.:

smc120-0004.ipp2a2.colossus.nvidia.com:/wrk/logs $ grep -a Segmentation *
qa_bindings_linux_2025-12-04+171342_tests_log.txt:Fatal Python error: Segmentation fault
qa_bindings_linux_2025-12-04+171342_tests_log.txt:../ctk-next/qa/13.1.0/qa_bindings_linux_tests.sh: line 60:  5457 Segmentation fault      (core dumped) python -m pytest -ra -s -vv tests/
qa_bindings_linux_2025-12-05+214218_tests_log.txt:Fatal Python error: Segmentation fault
qa_bindings_linux_2025-12-05+214218_tests_log.txt:../ctk-next/qa/13.1.0/qa_bindings_linux_tests.sh: line 61: 48514 Segmentation fault      (core dumped) CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1 python -m pytest -ra -s -vv tests/
qa_bindings_linux_2025-12-06+224850_tests_log.txt:Fatal Python error: Segmentation fault
qa_bindings_linux_2025-12-06+224850_tests_log.txt:../ctk-next/qa/13.1.0/qa_bindings_linux_tests.sh: line 60: 84340 Segmentation fault      (core dumped) python -m pytest -ra -s -vv tests/

I'm attaching one of the log files. Please see there for details.

qa_bindings_linux_2025-12-06+224850_tests_log.txt

How to Reproduce

See commands in attached log file. Essentially:

cd cuda_bindings/
pip install ...
pytest -ra -s -v tests/

rwgk avatar Dec 07 '25 18:12 rwgk

@sourabgupta3

rwgk avatar Dec 07 '25 18:12 rwgk

@rwgk could you check if https://github.com/NVIDIA/cuda-python/pull/1468 would fix it?

leofang avatar Jan 23 '26 14:01 leofang

I deleted the comment I posted a few minutes ago, I'll have to try again :-(

Sorry I forgot to check for the silent downgrading before, and it bit again:

smc120-0009.ipp2a2.colossus.nvidia.com:/home/scratch.rgrossekunst_sw/logs_mirror/smc120-0009.ipp2a2.colossus/logs/test_cufile_multi_v13.1_13de2c20 $ grep 'Successfully installed' ../
cuda-python_qa_bindings_linux_2026-01-23+150622_build_log.txt
Successfully installed pip-25.3
  Successfully installed packaging-26.0 setuptools-80.10.1 setuptools_scm-9.2.2 wheel-0.46.3
Successfully installed cuda-pathfinder-1.3.4.dev109+g13de2c20b iniconfig-2.3.0 packaging-26.0 pluggy-1.6.0 pygments-2.19.2 pytest-9.0.2
  Successfully installed cython-3.2.4 packaging-26.0 pyclibrary-0.3.0 pyparsing-3.3.2 setuptools-80.10.1 setuptools_scm-9.2.2
Successfully installed cuda-bindings-13.1.2.dev95+g13de2c20b cython-3.2.4 numpy-2.4.1 py-cpuinfo-9.0.0 pyglet-2.1.12 pytest-benchmark-5.2.3 setuptools-80.10.1
  Successfully installed Cython-3.2.4 packaging-26.0 setuptools-80.10.1 setuptools-scm-9.2.2
  Successfully installed cuda-bindings-13.1.1 cuda-pathfinder-1.3.3
Successfully installed cuda-core-0.5.1.dev62+g13de2c20b pytest-randomly-4.0.1

rwgk avatar Jan 24 '26 01:01 rwgk

Oh! The logs and summary I posted before were actually correct. I didn't realize that's a side-effect of the build isolation. TIL

(I'll repost the logs asap)

  Explanation

  The "Successfully installed cuda-bindings-13.1.1" message is from a temporary build environment, not your main virtual environment.
  1. First installation (line 2185): cuda-bindings-13.1.2.dev95+g13de2c20b is installed in TestVenv.
  2. Building cuda-core (lines 2244-2258): When installing cuda-core in editable mode, pip creates a temporary build environment
     (/tmp/rgrossekunst-tmp/pip-build-env-yu8kbrk0/overlay/) to install backend dependencies needed to build the package.
  3. Backend dependency installation (lines 2247-2257): In that build environment, pip installs cuda-bindings==13.* (from cuda-core's pyproject.toml), which resolves to 13.1.1
     from PyPI. The "Successfully installed" message refers to this temporary environment.
  4. Final state: After the build, the main TestVenv still has cuda-bindings-13.1.2.dev95+g13de2c20b installed, which is why pip list shows that version.

  This is pip's build isolation: backend dependencies are installed in a temporary environment for building, and those messages can be misleading because they refer to the build
  environment, not your main environment. The main environment is unaffected by those installations.

rwgk avatar Jan 24 '26 01:01 rwgk

@sourabgupta3 for awareness — Note: the below is for CTK 13.1.1 (cuda_13.1.1_590.48.01_linux.run)

Reposting after convincing myself that the build worked as expected:


could you check if https://github.com/NVIDIA/cuda-python/pull/1468 would fix it?

It seems to be better, but there is still >10% flakiness.

Full logs and additional files with many details are here (internal access only):

/home/scratch.rgrossekunst_sw/logs_mirror/smc120-0009.ipp2a2.colossus/logs/test_cufile_multi_v13.1_13de2c20

The matching full build log:

/home/scratch.rgrossekunst_sw/logs_mirror/smc120-0009.ipp2a2.colossus/logs/cuda-python_qa_bindings_linux_2026-01-23+150622_build_log.txt

Here is a high-level summary based on the full log files:

================================================================================
QA Test Logs Analysis Summary
================================================================================

Total files analyzed: 200
Files with no flakes (all passed): 179
Files with crashes: 21

================================================================================
Error Details
================================================================================

Files with crashes (21):
  - trial17_norm_log_2026-01-23+153614.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial1_ptds_log_2026-01-23+152315.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial20_norm_log_2026-01-23+153849.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial23_norm_log_2026-01-23+154116.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial25_ptds_log_2026-01-23+154315.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial2_ptds_log_2026-01-23+152404.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial32_ptds_log_2026-01-23+154932.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial45_ptds_log_2026-01-23+160048.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial47_ptds_log_2026-01-23+160227.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_buf_register_already_registered
      Error messages:
        invalid directIO size (KB) specified: 0 min: 64 max: 16384
        invalid directIO size (KB) specified: 0 min: 1 max: 256
        invalid poll threshold size (KB) specified: 0 min: 4 max: 18014398509481980
        invalid io timeout specified, (ms) 0 min: 1 max: 1000
        invalid directIO size (KB) specified: 0 min: 1 max: 256
      Crash indicator: Fatal Python error: Floating point exception

  - trial4_norm_log_2026-01-23+152516.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_cufile_read_write_host_memory
      Error messages:
        invalid directIO size (KB) specified: 0 min: 64 max: 16384
        invalid directIO size (KB) specified: 0 min: 1 max: 256
        invalid poll threshold size (KB) specified: 0 min: 4 max: 18014398509481980
        invalid io timeout specified, (ms) 0 min: 1 max: 1000
        invalid directIO size (KB) specified: 0 min: 1 max: 256
      Crash indicator: Fatal Python error: Floating point exception

  - trial50_ptds_log_2026-01-23+160440.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial54_norm_log_2026-01-23+160740.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial61_norm_log_2026-01-23+161345.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial67_norm_log_2026-01-23+161848.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial68_ptds_log_2026-01-23+161951.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial73_norm_log_2026-01-23+162328.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial7_norm_log_2026-01-23+152733.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial82_ptds_log_2026-01-23+163143.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial91_norm_log_2026-01-23+163852.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

  - trial92_ptds_log_2026-01-23+164004.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_stats_start_stop
      Error messages:
        invalid directIO size (KB) specified: 0 min: 64 max: 16384
        invalid directIO size (KB) specified: 0 min: 1 max: 256
        invalid poll threshold size (KB) specified: 0 min: 4 max: 18014398509481980
        invalid io timeout specified, (ms) 0 min: 1 max: 1000
        invalid directIO size (KB) specified: 0 min: 1 max: 256
      Crash indicator: Fatal Python error: Floating point exception

  - trial95_ptds_log_2026-01-23+164220.txt
    Number of crashes: 1
    Crash at line 1:
      rootdir: /wrk/forked/cuda-python/cuda_bindings
      Test session start: ============================= test session starts ==============================
      Likely failing test: tests/test_cufile.py::test_get_stats_l3
      Crash indicator: Fatal Python error: Segmentation fault

================================================================================
Overall Statistics
================================================================================

Total tests passed (across all files): 5006

================================================================================
ERROR Summary
================================================================================

     1  ERROR tests/test_cufile.py::test_batch_io_cancel - cuda.bindings.cufile.cuFil...
     1  ERROR tests/test_cufile.py::test_batch_io_large_operations - cuda.bindings.cu...
     1  ERROR tests/test_cufile.py::test_buf_register_multiple_buffers - cuda.binding...
     1  ERROR tests/test_cufile.py::test_get_parameter_min_max_value - cuda.bindings....
     1  ERROR tests/test_cufile.py::test_get_stats_l3 - cuda.bindings.cufile.cuFileEr...
     1  ERROR tests/test_cufile.py::test_handle_register - cuda.bindings.cufile.cuFil...

================================================================================
Error Type Summary
================================================================================

Crashes: 21 files
  - trial17_norm_log_2026-01-23+153614.txt
  - trial1_ptds_log_2026-01-23+152315.txt
  - trial20_norm_log_2026-01-23+153849.txt
  - trial23_norm_log_2026-01-23+154116.txt
  - trial25_ptds_log_2026-01-23+154315.txt
  - trial2_ptds_log_2026-01-23+152404.txt
  - trial32_ptds_log_2026-01-23+154932.txt
  - trial45_ptds_log_2026-01-23+160048.txt
  - trial47_ptds_log_2026-01-23+160227.txt
  - trial4_norm_log_2026-01-23+152516.txt
  - trial50_ptds_log_2026-01-23+160440.txt
  - trial54_norm_log_2026-01-23+160740.txt
  - trial61_norm_log_2026-01-23+161345.txt
  - trial67_norm_log_2026-01-23+161848.txt
  - trial68_ptds_log_2026-01-23+161951.txt
  - trial73_norm_log_2026-01-23+162328.txt
  - trial7_norm_log_2026-01-23+152733.txt
  - trial82_ptds_log_2026-01-23+163143.txt
  - trial91_norm_log_2026-01-23+163852.txt
  - trial92_ptds_log_2026-01-23+164004.txt
  - trial95_ptds_log_2026-01-23+164220.txt

================================================================================
Counts of "Likely failing test"
================================================================================

    18  tests/test_cufile.py::test_get_stats_l3
     1  tests/test_cufile.py::test_buf_register_already_registered
     1  tests/test_cufile.py::test_cufile_read_write_host_memory
     1  tests/test_cufile.py::test_stats_start_stop

rwgk avatar Jan 24 '26 04:01 rwgk