cuda-python icon indicating copy to clipboard operation
cuda-python copied to clipboard

cufile tests don't pass on ext4 filesystems in CI

Open cpcloud opened this issue 2 months ago • 4 comments

After https://github.com/NVIDIA/cuda-python/pull/1271, we have to xfail a number of cufile tests in CI that run on an ext4 filesystem even though that is nominally supported.

It would be good to figure out what's going on here and why these don't work in CI even though they are running on an ext4 filesystem.

cpcloud avatar Dec 03 '25 19:12 cpcloud

cc @sourabgupta3

cpcloud avatar Dec 03 '25 19:12 cpcloud

I was curious and looked at the CI run on main after #1271 was merged:

https://github.com/NVIDIA/cuda-python/actions/runs/19907928749

After unpacking the log archive locally:

$ grep 'Z XFAIL' Test*.txt | wc -l
169

Showing only the first few:

$ grep 'Z XFAIL' Test*.txt | head -n 20
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0964244Z XFAIL tests/test_cufile.py::test_handle_register - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0966208Z XFAIL tests/test_cufile.py::test_cufile_read_write - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0967481Z XFAIL tests/test_cufile.py::test_cufile_read_write_host_memory - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0969007Z XFAIL tests/test_cufile.py::test_cufile_read_write_large - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0970180Z XFAIL tests/test_cufile.py::test_cufile_write_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0971317Z XFAIL tests/test_cufile.py::test_cufile_read_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0972477Z XFAIL tests/test_cufile.py::test_cufile_async_read_write - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0973611Z XFAIL tests/test_cufile.py::test_batch_io_basic - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0975287Z XFAIL tests/test_cufile.py::test_batch_io_cancel - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0976371Z XFAIL tests/test_cufile.py::test_batch_io_large_operations - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0977385Z XFAIL tests/test_cufile.py::test_get_stats_l1 - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0978304Z XFAIL tests/test_cufile.py::test_get_stats_l2 - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0979223Z XFAIL tests/test_cufile.py::test_get_stats_l3 - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0949667Z XFAIL tests/test_cufile.py::test_handle_register - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0950742Z XFAIL tests/test_cufile.py::test_cufile_read_write - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0952276Z XFAIL tests/test_cufile.py::test_cufile_read_write_host_memory - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0953452Z XFAIL tests/test_cufile.py::test_cufile_read_write_large - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0955089Z XFAIL tests/test_cufile.py::test_cufile_write_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0956151Z XFAIL tests/test_cufile.py::test_cufile_read_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0957243Z XFAIL tests/test_cufile.py::test_cufile_async_read_write - handle_register call fails in CI for unknown reasons

Note that I've seen all those tests passing interactively on a couple colossus machines (linux-64, linux-aarch64).

rwgk avatar Dec 03 '25 23:12 rwgk

Based on offline discussions a possible theory why we had to add xfail was because cuFILE does not (yet) support containers (overlayfs). We will be able to test this theory (soon) once cuFILE adds this capability.

leofang avatar Dec 04 '25 02:12 leofang

Not quite.

In this case, the tests are running on ext4, because the working directory of the runner (which the test directory is a child of) is mounted on a virtual disk (usually /dev/vda1) with an ext4 filesystem.

/ is an independent mount point from the mount point at /github/home (or whatever the runner working directory is called), meaning just because / is a prefix of /github/home doesn't mean that / is somehow a mount point of /github/home itself, so these failures are not currently explained by the fact that / happens to be overlayfs.

cpcloud avatar Dec 04 '25 17:12 cpcloud