cufile tests don't pass on ext4 filesystems in CI
After https://github.com/NVIDIA/cuda-python/pull/1271, we have to xfail a number of cufile tests in CI that run on an ext4 filesystem even though that is nominally supported.
It would be good to figure out what's going on here and why these don't work in CI even though they are running on an ext4 filesystem.
cc @sourabgupta3
I was curious and looked at the CI run on main after #1271 was merged:
https://github.com/NVIDIA/cuda-python/actions/runs/19907928749
After unpacking the log archive locally:
$ grep 'Z XFAIL' Test*.txt | wc -l
169
Showing only the first few:
$ grep 'Z XFAIL' Test*.txt | head -n 20
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0964244Z XFAIL tests/test_cufile.py::test_handle_register - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0966208Z XFAIL tests/test_cufile.py::test_cufile_read_write - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0967481Z XFAIL tests/test_cufile.py::test_cufile_read_write_host_memory - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0969007Z XFAIL tests/test_cufile.py::test_cufile_read_write_large - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0970180Z XFAIL tests/test_cufile.py::test_cufile_write_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0971317Z XFAIL tests/test_cufile.py::test_cufile_read_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0972477Z XFAIL tests/test_cufile.py::test_cufile_async_read_write - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0973611Z XFAIL tests/test_cufile.py::test_batch_io_basic - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0975287Z XFAIL tests/test_cufile.py::test_batch_io_cancel - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0976371Z XFAIL tests/test_cufile.py::test_batch_io_large_operations - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0977385Z XFAIL tests/test_cufile.py::test_get_stats_l1 - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0978304Z XFAIL tests/test_cufile.py::test_get_stats_l2 - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.10__13.0.2__wheels__l4.txt:2025-12-03T20:40:43.0979223Z XFAIL tests/test_cufile.py::test_get_stats_l3 - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0949667Z XFAIL tests/test_cufile.py::test_handle_register - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0950742Z XFAIL tests/test_cufile.py::test_cufile_read_write - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0952276Z XFAIL tests/test_cufile.py::test_cufile_read_write_host_memory - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0953452Z XFAIL tests/test_cufile.py::test_cufile_read_write_large - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0955089Z XFAIL tests/test_cufile.py::test_cufile_write_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0956151Z XFAIL tests/test_cufile.py::test_cufile_read_async - handle_register call fails in CI for unknown reasons
Test_linux-64___py3.11__13.0.2__local__l4.txt:2025-12-03T20:40:58.0957243Z XFAIL tests/test_cufile.py::test_cufile_async_read_write - handle_register call fails in CI for unknown reasons
Note that I've seen all those tests passing interactively on a couple colossus machines (linux-64, linux-aarch64).
Based on offline discussions a possible theory why we had to add xfail was because cuFILE does not (yet) support containers (overlayfs). We will be able to test this theory (soon) once cuFILE adds this capability.
Not quite.
In this case, the tests are running on ext4, because the working directory of the runner (which the test directory is a child of) is mounted on a virtual disk (usually /dev/vda1) with an ext4 filesystem.
/ is an independent mount point from the mount point at /github/home (or whatever the runner working directory is called), meaning just because / is a prefix of /github/home doesn't mean that / is somehow a mount point of /github/home itself, so these failures are not currently explained by the fact that / happens to be overlayfs.