kvikio
kvikio copied to clipboard
kvikio still segfaults on program termination
Hi everyone,
I'm getting a segfault when my python script terminates. This only happens when kvikio is used.
Reproducer
mamba env create -f img2tensor_kvikio.yaml && mamba clean -afy
// img2tensor_kvikio.yaml
name: img2tensor
channels:
- pytorch
- nvidia
- rapidsai
- conda-forge
dependencies:
- notebook
- tifffile
- python=3.11
- pytorch
- pytorch-cuda=12.4
- kvikio
bug.py
import kvikio
file_name = 'file0.txt'
fd = kvikio.CuFile(file_name, "w")
fd.close()
I'm running in a kubernetes environment. We use the open kernel driver 535.183.01
I assumed this #462 has fixed the issue but it seems there is more to it.
You can find the concretized environment here: exported_img2tensor_kvikio.txt
It uses kvikio 24.10 which should include the previously mentioned PR.
Could you please slim the environment further like so and retry?
# filename: kvikio2410_cuda122.yaml
name: kvikio2410_cuda122
channels:
- rapidsai
- conda-forge
dependencies:
- cuda-version=12.2
- python=3.11
- kvikio=24.10
Asking as there are mismatching CUDA versions in the reproducing environment. Plus some extra bits that appear unused in the example. So would like to simplify further to avoid other potential issues
Unfortunately it still segfaults. I again attached the concretized dependency list kvikio2410_cuda122.txt.
The cuda version mismatch seems resolved. Also the cufile.log seems fine to me. I'm using a MIG slice from an A100 and writing to a weka fs works fine. It only segfaults on program termination
Can you show a backtrace from the segfault. e.g. with gdb:
gdb --args python bug.py
(gdb) run
(gdb) backtrace full
(gdb) run
Starting program: /opt/conda/envs/kvikio2410_cuda122/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47eb700 (LWP 2675)]
[New Thread 0x7ffff3fea700 (LWP 2676)]
[New Thread 0x7fffeb7e9700 (LWP 2677)]
[New Thread 0x7fffdaae0700 (LWP 2678)]
[New Thread 0x7fffcdfff700 (LWP 2679)]
[New Thread 0x7fffcd21d700 (LWP 2691)]
[New Thread 0x7fffcca1c700 (LWP 2692)]
[New Thread 0x7fffc7fff700 (LWP 2693)]
[New Thread 0x7fffc77fe700 (LWP 2694)]
[New Thread 0x7fffc6ffd700 (LWP 2695)]
[New Thread 0x7fffc67fc700 (LWP 2696)]
[New Thread 0x7fffc5ffb700 (LWP 2697)]
[New Thread 0x7fffc57fa700 (LWP 2698)]
[Thread 0x7fffdaae0700 (LWP 2678) exited]
[Thread 0x7fffcd21d700 (LWP 2691) exited]
[Thread 0x7fffc57fa700 (LWP 2698) exited]
[Thread 0x7fffc5ffb700 (LWP 2697) exited]
[Thread 0x7fffc6ffd700 (LWP 2695) exited]
[Thread 0x7fffc77fe700 (LWP 2694) exited]
[Thread 0x7fffc7fff700 (LWP 2693) exited]
[Thread 0x7fffcca1c700 (LWP 2692) exited]
[Thread 0x7fffeb7e9700 (LWP 2677) exited]
[Thread 0x7ffff3fea700 (LWP 2676) exited]
[Thread 0x7ffff47eb700 (LWP 2675) exited]
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffffd7a8, __s=0x5555563aa252 "", __n=93824998875808)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
90 /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc: No such file or directory.
(gdb) backtrace full
#0 std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffffd7a8, __s=0x5555563aa252 "", __n=93824998875808)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
__remaining = <optimized out>
__len = <optimized out>
__buf_len = 8388607
__ret = <optimized out>
#1 0x00007ffff78c169d in std::__ostream_write<char, std::char_traits<char> > (__out=..., __s=<optimized out>, __n=93824998875808)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:325
__put = <optimized out>
#2 0x00007ffff78c1774 in std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x555555baa298 "Read", __n=93824998875808)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:184
__w = <error reading variable __w (dwarf2_find_location_expression: Corrupted DWARF expression.)>
__cerb = {_M_ok = true, _M_os = @0x7fffffffd7a0}
#3 0x00007fffda13044f in ?? () from /opt/conda/envs/kvikio2410_cuda122/lib/python3.11/site-packages/kvikio/_lib/../../../../libcufile.so.0
No symbol table info available.
#4 0x00007fffda13206b in ?? () from /opt/conda/envs/kvikio2410_cuda122/lib/python3.11/site-packages/kvikio/_lib/../../../../libcufile.so.0
No symbol table info available.
#5 0x00007fffda080c82 in ?? () from /opt/conda/envs/kvikio2410_cuda122/lib/python3.11/site-packages/kvikio/_lib/../../../../libcufile.so.0
No symbol table info available.
#6 0x00007ffff7fe0f6b in _dl_fini () at dl-fini.c:138
array = 0x7fffda2bc1d0
i = <optimized out>
l = 0x555555efa720
maps = 0x7fffffffdb80
i = <optimized out>
l = <optimized out>
nmaps = <optimized out>
nloaded = <optimized out>
ns = 0
do_audit = <optimized out>
__PRETTY_FUNCTION__ = "_dl_fini"
#7 0x00007ffff7c9a8a7 in __run_exit_handlers (status=0, listp=0x7ffff7e40718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
atfct = <optimized out>
onfct = <optimized out>
cxafct = <optimized out>
f = <optimized out>
new_exitfn_called = 262
cur = 0x7ffff7e41ca0 <initial>
#8 0x00007ffff7c9aa60 in __GI_exit (status=<optimized out>) at exit.c:139
No locals.
#9 0x00007ffff7c7808a in __libc_start_main (main=0x5555557dea20 <main>, argc=2, argv=0x7fffffffdec8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdeb8) at ../csu/libc-start.c:342
result = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {93824995523264, -3934155394888934001, 93824994896209, 140737488346816, 0, 0, 3934155393885101455, 3934172503229554063}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x2,
0x7fffffffdec8}, data = {prev = 0x0, cleanup = 0x0, canceltype = 2}}}
not_first_call = <optimized out>
#10 0x00005555557de97a in _start () at /usr/local/src/conda/python-3.11.10/Parser/parser.c:33931
No symbol table info available.
OK, thanks. Something in cufile is running below main. We'll try and reproduce locally and perhaps build with a debug build so we can get a bit more information.
Thanks a lot for looking into this. If there is something I can do to help you reproduce the error please let me know.
@EricKern, what if you run with KVIKIO_COMPAT_MODE=ON ?
JFYI, to get a debug build of python add the following to channels above conda-forge: conda-forge/label/python_debug
@EricKern, what if you run with KVIKIO_COMPAT_MODE=ON ?
With compat mode on there is no segmentation fault. If I set it to "off" then it appears again.
JFYI, to get a debug build of
pythonadd the following tochannelsaboveconda-forge:conda-forge/label/python_debug
Do you think that this might produce a better backtrace from the crash or is there anything else that I could do with a debug build of python?
Lawrence mentioned doing a debug build. So wanted to share that resource
If the segfault happens somewhere in KvikIO, it may help. If it happens in cuFile, we likely don't learn much
If Mads can't repro next week, I guess I'll try and figure out how to set up cufile/gds on my workstation and do some spelunking
If Mads can't repro next week, I guess I'll try and figure out how to set up cufile/gds on my workstation and do some spelunking
I will take a look tomorrow
I am not able to reproduce, the conda environment works fine for me :/ I have asked the cuFile team for input.
cuDF is seeing the same issue (https://github.com/rapidsai/cudf/issues/17121) arising from cuFile (here cuFile API is accessed directly from within cuDF not through KvikIO).
Btw, when cuDF did use KvikIO to perform GDS I/O, we observed that the segfault is manifested when KVIKIO_NTHREADS is set to 8, not the default 1. But I think this is a red herring. At the time of crash, backtrace points to some CUDA calls made by cuFile after the main returns. This should be cuFile doing implicit driver closing.
Also, adding cuFileDriverClose() before the main returns seems to prevent the segfault in cuDF's benchmark.
@madsbk May I ask if you have used a MIG slice or a full GPU in your tests? I'm currently not able to use a full A100 but as soon it's available again I want to try and reproduce the segfault on a full A100. Before using kvikio I have successfully used the cufile C++ API without a problem. Even with a MIG.
I am running on a full GPU.
https://github.com/rapidsai/kvikio/pull/514 implements Python bindings to cufileDriverOpen() and cufileDriverClose(). The hope is that we can prevent this issue in Python by calling cufileDriverClose() and module exit.
I continued playing around with the environment to ensure the issue was not related to my setup.
Just a few minutes ago I found out that the segmentation fault on termination does not occur when I set "cufile_stats": 0 in the cufile.json. Any value in cufile_stats above 0 causes the segfault. But as mentioned during execution everything works fine. The READ-WRITE SIZE histogram is written to cufile.log and all. Only on termination, the segfault happens. I could observe this both inside and outside of a docker container.
Do you still think that this is related to cufileDriverClose()?
Originally by @EricKern in https://github.com/rapidsai/kvikio/pull/514#issuecomment-2439958534:
I've built and reran my small segfault reproducer script without explicitly opening and closing the driver. This still causes the segfault when I set
profile.cufile_statsin cufile.json to anything above 0. Also when I explicitly open and close the driver it still happens.If
profile.cufile_stats=0everything works fine.I guess my segfault (#497) is unrelated to the driver initialization and destruction.
I have tested this on my local machine where I currently don't have a GDS-supported file system. So no actual writing happened. Only initialization and then cufile's switch to its own compatibility mode. But even then, the segfault was reproducible on another machine.
@tell-rebanta do you know of an cuFIle bug related to setting profile.cufile_stats to something greater than zero?
@madsbk I am not aware of any cufile bug related to > 0 cufile_stats value. Wrote a small program which does direct dlopen of libcufile (not through kvikio) without explicit opening/closing the driver along with a non-zero positive cufile_stats value, but could not reproduce the issue with the latest bits of libcufile. Which libcufile version you were using ?
@tell-rebanta according to cufile.log debug output:
GDS release version: 1.7.2.10
nvidia_fs version: 2.17
libcufile version: 2.12
Platform: x86_64
I can install gds-tools and set LD_LIBRARY_PATH to libcufile of the conda installation (/opt/conda/envs/kvikio2410_cuda122/lib/) and then run gdsio with it. Then there is no problem. No segfault occurs independent of the cufile_stats level.
The segfault only happens when libcufile is loaded by kvikio in python when the python program terminates.
Of course the possibility of a user error on myside still exists. I remember that the segfault also happend a few weeks ago when I was trying out cuCIM. This was a hint to me that it might be caused by my environment. As far as I know cuCIM have their own gds wrapper and don't use kvikio under the hood. At that time I had no idea what the root cause could be and switched to kvikio. But since then with kvikio I could reproduce the segfault in a kubernetes pod, on a VM inside and outside a docker container and on my personal laptop. So I assume that this error is not related to the machines I'm running on.
From the software perspective I assume the containerized environment should also rule out any software environment issues.
My docker image is basically:
FROM condaforge/miniforge3:24.3.0-0 as base
RUN apt-get update && \
apt-get -y install ibverbs-providers libibverbs-dev librdmacm-dev \
&& apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/*
COPY kvikio2410_cuda122.yaml /tmp/
RUN mamba env create -f /tmp/kvikio2410_cuda122.yaml && mamba clean -afy
RUN apt-get update && \
apt-get -y install libnuma-dev \
&& apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/*
Then I run the container with this docker wrapper or even slightly more priviliges when using a wekaFS in kubernetes (hostnetwork=true).
I don't know what else I could do wrong or you do differently
@madsbk How do we continue with this? Have you been able to reproduce the segfault with cufile-stats > 0?
Sorry, I am still not able to reproduce :/
Can you try setting allow_compat_mode=false in the config cufile.json? This will force cuFile to use GDS or fail.
Also try setting execution::parallel_io=false to rule out a threading issue.
Thanks for the suggestions I'll try it with these options again
Hi All, I've encountered the exact same segmentation fault as @EricKern , with an identical backtrace stack.
I've spent several hours attempting to resolve this by creating lots different virtual environments using Conda and Mamba for kvikio and cudf. Unfortunately, every attempt has led to a segmentation fault when using GDS. While some backtraces don't display all the information, I suspect they share the same root cause.
I'm wondering if there's a way to bypass this segmentation fault. I haven't been able to find a workaround or any related issues. Although I believe this segmentation fault doesn't impact the program's performance, it does sometimes corrupt parts of the nsys profiling results when the program terminates.
Seg fault is still here if I do in cufile.json: allow_compat_mode=false and execution::parallel_io=false as @madsbk suggested in this issue page.
(Before I found this issue page, I also decided to investigate whether raw cufile calls would result in the same segmentation fault. To do this, I let the program using raw cufile calls to link to the same shared object file from venv. Interestingly, no segmentation fault occurred. This is just an additional observation.)
Setup
Generally, I try to reproduce as you discussed previously but I update kvikio to 25.02.
- I wonder if it is an issue that my
gdscheckdoes not match with the conda cufile verison.
gdscheck
$ gdscheck -v
GDS release version: 1.13.1.3
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64
conda yaml with kvikio=25.02:
name: kvikio2502_cuda122
channels:
- rapidsai
- conda-forge
dependencies:
- cuda-version=12.2
- python=3.12
- kvikio=25.02
details
$ conda list kvikio
# packages in environment at /home/jluo/miniconda3/envs/kvikio2502_cuda122:
#
# Name Version Build Channel
kvikio 25.02.01 cuda12_py312_250227_g8fecf06_0 rapidsai
libkvikio 25.02.01 cuda12_250227_g8fecf06_0 rapidsai
$ conda list cufile
# packages in environment at /home/jluo/miniconda3/envs/kvikio2502_cuda122:
#
# Name Version Build Channel
libcufile 1.7.2.10 hd3aeb46_0 conda-forge
libcufile-dev 1.7.2.10 hd3aeb46_0 conda-forge
gdb bt:
[Thread 0x7ffef0ac5640 (LWP 1876331) exited]
[Thread 0x7ffef22c8640 (LWP 1876328) exited]
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffffb0c8, __s=0x5555570589b2 "", __n=93825018482064) at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1740238128824/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
90 /home/conda/feedstock_root/build_artifacts/gcc_compilers_1740238128824/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc: No such file or directory.
(gdb) bt full
#0 std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffffb0c8, __s=0x5555570589b2 "", __n=93825018482064)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1740238128824/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
__remaining = <optimized out>
__len = <optimized out>
__buf_len = 2097151
__ret = <optimized out>
#1 0x00007ffff72e9c37 in std::__ostream_write<char, std::char_traits<char> > (__out=..., __s=<optimized out>, __n=93825018482064)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1740238128824/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:53
__put = <optimized out>
#2 0x00007ffff72e9d0e in std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x555556e589f8 "Read", __n=93825018482064)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1740238128824/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:104
__w = 0
__cerb = {_M_ok = true, _M_os = @0x7fffffffb0c0}
#3 0x00007ffe9864d44f in ?? () from /home/jluo/miniconda3/envs/kvikio2502_cuda122/lib/python3.12/site-packages/kvikio/_lib/../../../.././libcufile.so.0
No symbol table info available.
#4 0x00007ffe9864f06b in ?? () from /home/jluo/miniconda3/envs/kvikio2502_cuda122/lib/python3.12/site-packages/kvikio/_lib/../../../.././libcufile.so.0
No symbol table info available.
#5 0x00007ffe9859dc82 in ?? () from /home/jluo/miniconda3/envs/kvikio2502_cuda122/lib/python3.12/site-packages/kvikio/_lib/../../../.././libcufile.so.0
No symbol table info available.
#6 0x00007ffff7fc924e in ?? () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#7 0x00007ffff7cca495 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#8 0x00007ffff7cca610 in exit () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#9 0x00007ffff7caed97 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#10 0x00007ffff7caee40 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#11 0x00005555557f9321 in _start ()
No symbol table info available.
Below is a minimally reproducible C++ example for the segfault. It emulates KvikIO's dynamic loading, which seems to be correlated to the issue observed.
Concretely, KvikIO dynamically loads the cuFile shared library at runtime, and therefore the information on the cuFile library is not passed to the linker at link time. Apparently, the segfault somehow occurs with this setup (run_bad.sh).
If the information on the cuFile library is passed to the linker, the segfault would then be gone (run_good.sh). This, however, defies our purpose of using dynamic loading.
biu.cpp
Single source file to reproduce the segfault.
#include <cufile.h>
#include <dlfcn.h>
#include <fcntl.h>
#include <unistd.h>
#include <iostream>
#include <sstream>
#define CHECK_CUFILE(err_code) check_cufile(err_code, __FILE__, __LINE__)
void check_cufile(CUfileError_t err_code, const char* file, int line)
{
auto cufile_err_code = err_code.err; // CUfileOpError
if (cufile_err_code != CU_FILE_SUCCESS) {
std::stringstream ss;
ss << "cuFile error at" << file << ":" << line << std::endl;
throw std::runtime_error(ss.str());
}
}
#define EXPECT(condition) expect(condition, __FILE__, __LINE__)
inline void expect(bool condition, const char* file, int line)
{
if (condition) { return; }
std::stringstream ss;
ss << "EXPECT failed at " << file << ":" << line << std::endl;
throw std::runtime_error(ss.str());
}
class TestManager {
public:
TestManager()
{
load_library();
int flags{O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT};
mode_t mode{S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH};
_fd = open(_file_path.c_str(), flags, mode);
EXPECT(_fd != -1);
CUfileDescr_t desc{};
desc.type = CU_FILE_HANDLE_TYPE_OPAQUE_FD;
desc.handle.fd = _fd;
CHECK_CUFILE(_func_handle_register(&_handle, &desc));
}
~TestManager()
{
_func_handle_deregister(_handle);
EXPECT(close(_fd) == 0);
std::cout << "test done" << std::endl;
}
private:
void load_library()
{
dlerror();
auto* cufile_lib_handle =
dlopen(_cufile_lib_path.c_str(), RTLD_LAZY | RTLD_LOCAL | RTLD_NODELETE);
get_symbol(_func_handle_register, cufile_lib_handle, "cuFileHandleRegister");
get_symbol(_func_handle_deregister, cufile_lib_handle, "cuFileHandleDeregister");
}
template <typename F>
void get_symbol(F& func, void* cufile_lib_handle, std::string const& name)
{
dlerror();
func = reinterpret_cast<std::decay_t<F>>(dlsym(cufile_lib_handle, name.c_str()));
auto* err = dlerror();
if (err != nullptr) { throw std::runtime_error(err); }
}
int _fd{};
CUfileHandle_t _handle{};
std::string _cufile_lib_path{"/usr/local/cuda/targets/sbsa-linux/lib/libcufile.so.0"};
std::string _file_path{"/mnt/nvme/biu.bin"};
std::decay_t<decltype(cuFileHandleRegister)> _func_handle_register;
std::decay_t<decltype(cuFileHandleDeregister)> _func_handle_deregister;
};
int main()
{
TestManager tm;
return 0;
}
my_cufile.json
A copy of /etc/cufile.json, except that cufile_stats is assigned a positive value.
{
// NOTE : Application can override custom configuration via export CUFILE_ENV_PATH_JSON=<filepath>
// e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile.json"
"logging": {
// log directory, if not enabled will create log file under current working directory
//"dir": "/home/<xxxx>",
// NOTICE|ERROR|WARN|INFO|DEBUG|TRACE (in decreasing order of severity)
"level": "ERROR"
},
"profile": {
// nvtx profiling on/off
"nvtx": false,
// cufile stats level(0-3)
"cufile_stats": 3
},
"execution" : {
// max number of workitems in the queue;
"max_io_queue_depth": 128,
// max number of host threads per gpu to spawn for parallel IO
"max_io_threads" : 4,
// enable support for parallel IO
"parallel_io" : true,
// minimum IO threshold before splitting the IO
"min_io_threshold_size_kb" : 8192,
// maximum parallelism for a single request
"max_request_parallelism" : 4
},
"properties": {
// max IO chunk size (parameter should be multiples of 64K) used by cuFileRead/Write internally per IO request
"max_direct_io_size_kb" : 16384,
// device memory size (parameter should be 4K aligned) for reserving bounce buffers for the entire GPU
"max_device_cache_size_kb" : 131072,
// per-io bounce-buffer size (parameter should be multiples of 64K) ranging from 1024kb to 16384kb
// Note: ensure (max_device_cache_size_kb / per_buffer_cache_size_kb) >= io_batchsize
"per_buffer_cache_size_kb": 1024,
// limit on maximum device memory size (parameter should be 4K aligned) that can be pinned for a given process
"max_device_pinned_mem_size_kb" : 33554432,
// true or false (true will enable asynchronous io submission to nvidia-fs driver)
// Note : currently the overall IO will still be synchronous
"use_poll_mode" : false,
// maximum IO request size (parameter should be 4K aligned) within or equal to which library will use polling for IO completion
"poll_mode_max_size_kb": 4,
// allow p2pdma, this will enable use of cuFile without nvme patches
"use_pci_p2pdma": false,
// allow compat mode, this will enable use of cuFile posix read/writes
"allow_compat_mode": true,
// enable GDS write support for RDMA based storage
"gds_rdma_write_support": true,
// GDS batch size
"io_batchsize": 128,
// enable io priority w.r.t compute streams
// valid options are "default", "low", "med", "high"
"io_priority": "default",
// client-side rdma addr list for user-space file-systems(e.g ["10.0.1.0", "10.0.2.0"])
"rdma_dev_addr_list": [ ],
// load balancing policy for RDMA memory registration(MR), (RoundRobin, RoundRobinMaxMin)
// In RoundRobin, MRs will be distributed uniformly across NICS closest to a GPU
// In RoundRobinMaxMin, MRs will be distributed across NICS closest to a GPU
// with minimal sharing of NICS acros GPUS
"rdma_load_balancing_policy": "RoundRobin",
//32-bit dc key value in hex
//"rdma_dc_key": "0xffeeddcc",
//To enable/disable different rdma OPs use the below bit map
//Bit 0 - If set enables Local RDMA WRITE
//Bit 1 - If set enables Remote RDMA WRITE
//Bit 2 - If set enables Remote RDMA READ
//Bit 3 - If set enables REMOTE RDMA Atomics
//Bit 4 - If set enables Relaxed ordering.
//"rdma_access_mask": "0x1f",
// In platforms where IO transfer to a GPU will cause cross RootPort PCie transfers, enabling this feature
// might help improve overall BW provided there exists a GPU(s) with Root Port common to that of the storage NIC(s).
// If this feature is enabled, please provide the ip addresses used by the mount either in file-system specific
// section for mount_table or in the rdma_dev_addr_list property in properties section
"rdma_dynamic_routing": false,
// The order describes the sequence in which a policy is selected for dynamic routing for cross Root Port transfers
// If the first policy is not applicable, it will fallback to the next and so on.
// policy GPU_MEM_NVLINKS: use GPU memory with NVLink to transfer data between GPUs
// policy GPU_MEM: use GPU memory with PCIe to transfer data between GPUs
// policy SYS_MEM: use system memory with PCIe to transfer data to GPU
// policy P2P: use P2P PCIe to transfer across between NIC and GPU
"rdma_dynamic_routing_order": [ "GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P" ]
},
"fs": {
"generic": {
// for unaligned writes, setting it to true will, cuFileWrite use posix write internally instead of regular GDS write
"posix_unaligned_writes" : false
},
"beegfs" : {
// IO threshold for read/write (param should be 4K aligned)) equal to or below which cuFile will use posix read/write
"posix_gds_min_kb" : 0
// To restrict the IO to selected IP list, when dynamic routing is enabled
// if using a single BeeGFS mount, provide the ip addresses here
//"rdma_dev_addr_list" : []
// if using multiple lustre mounts, provide ip addresses used by respective mount here
//"mount_table" : {
// "/beegfs/client1" : {
// "rdma_dev_addr_list" : ["172.172.1.40", "172.172.1.42"]
// },
// "/beegfs/client2" : {
// "rdma_dev_addr_list" : ["172.172.2.40", "172.172.2.42"]
// }
//}
},
"lustre": {
// IO threshold for read/write (param should be 4K aligned)) equal to or below which cuFile will use posix read/write
"posix_gds_min_kb" : 0
// To restrict the IO to selected IP list, when dynamic routing is enabled
// if using a single lustre mount, provide the ip addresses here (use : sudo lnetctl net show)
//"rdma_dev_addr_list" : []
// if using multiple lustre mounts, provide ip addresses used by respective mount here
//"mount_table" : {
// "/lustre/ai200_01/client" : {
// "rdma_dev_addr_list" : ["172.172.1.40", "172.172.1.42"]
// },
// "/lustre/ai200_02/client" : {
// "rdma_dev_addr_list" : ["172.172.2.40", "172.172.2.42"]
// }
//}
},
"nfs": {
// To restrict the IO to selected IP list, when dynamic routing is enabled
//"rdma_dev_addr_list" : []
//"mount_table" : {
// "/mnt/nfsrdma_01/" : {
// "rdma_dev_addr_list" : []
// },
// "/mnt/nfsrdma_02/" : {
// "rdma_dev_addr_list" : []
// }
//}
},
"gpfs": {
//allow GDS writes with GPFS
"gds_write_support": false,
//allow Async support
"gds_async_support": true
//"rdma_dev_addr_list" : []
//"mount_table" : {
// "/mnt/gpfs_01" : {
// "rdma_dev_addr_list" : []
// },
// "/mnt/gpfs_02/" : {
// "rdma_dev_addr_list" : []
// }
//}
},
"weka": {
// enable/disable RDMA write
"rdma_write_support" : false
}
},
"denylist": {
// specify list of vendor driver modules to deny for nvidia-fs (e.g. ["nvme" , "nvme_rdma"])
"drivers": [ ],
// specify list of block devices to prevent IO using cuFile (e.g. [ "/dev/nvme0n1" ])
"devices": [ ],
// specify list of mount points to prevent IO using cuFile (e.g. ["/mnt/test"])
"mounts": [ ],
// specify list of file-systems to prevent IO using cuFile (e.g ["lustre", "wekafs"])
"filesystems": [ ]
},
"miscellaneous": {
// enable only for enforcing strict checks at API level for debugging
"api_check_aggressive": false
}
}
Run this script to compile biu.cpp and observe the segfault. Note that CUFILE_ENV_PATH_JSON is set to my_cufile.json.
run_bad.sh
#!/usr/bin/env bash
rm -rf biu_check_bad_tmp
mkdir biu_check_bad_tmp
g++ -isystem /usr/local/cuda/targets/sbsa-linux/include -g -o biu_check_bad_tmp/biu.cpp.o -c biu.cpp
g++ -g biu_check_bad_tmp/biu.cpp.o -o biu_check_bad \
-Wl,-rpath,/usr/local/cuda-12.8/targets/sbsa-linux/lib \
-ldl
export CUFILE_ALLOW_COMPAT_MODE=false
export CUFILE_FORCE_COMPAT_MODE=false
export CUFILE_ENV_PATH_JSON=my_cufile.json
test_bin=./biu_check_bad
$test_bin
Sample output:
test done
./run_bad.sh: line 18: 1166604 Segmentation fault (core dumped) $test_bin
Run this script to compile biu.cpp. The segfault ceases to exist once the cuFile library is linked to the program.
run_good.sh
#!/usr/bin/env bash
rm -rf biu_check_good_tmp
mkdir biu_check_good_tmp
g++ -isystem /usr/local/cuda/targets/sbsa-linux/include -g -o biu_check_good_tmp/biu.cpp.o -c biu.cpp
g++ -g biu_check_good_tmp/biu.cpp.o -o biu_check_good \
-Wl,-rpath,/usr/local/cuda-12.8/targets/sbsa-linux/lib \
-ldl \
-L /usr/local/cuda/targets/sbsa-linux/lib/ -lcufile
export CUFILE_ALLOW_COMPAT_MODE=false
export CUFILE_FORCE_COMPAT_MODE=false
export CUFILE_ENV_PATH_JSON=my_cufile.json
test_bin=./biu_check_good
$test_bin
Sample output:
test done
@kingcrimsontianyu Thank you for this! I have a question: what’s the rationale behind KVIKIO’s use of dynamic loading?
Previously, I tested with raw cufile calls to identify the issue, as mentioned above. I used static linking to the shared object path, the same approach as in the run_good case you provided.
@kingcrimsontianyu Thank you for this! I have a question: what’s the rationale behind KVIKIO’s use of dynamic loading?
We want to support setups that doesn't have cuFile/gds installed. cuFile supports ARM64 as of CUDA toolkit v12.3+ so we might be able to use regular dynamic or static linking in the future but for now, we need to support systems without cuFile.
Following this thread. I am running into a similar issue. Python processes closes with seg-fault if cufile_stats > 0 only when kvikio and cufile compatibility mode set to false. This is for cuda version 12.6, gds release version 1.11.1.6, nvidia_fs version 2.25, libcufile version 2.12 on x86_64 platform.
cufile_stats = 0
(kvikio-venv) [strugf@gpu0300 CuFile]$ python
Python 3.11.1 (main, May 15 2025, 16:33:59) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import kvikio.cufile_driver
>>> kvikio.cufile_driver.driver_open()
>>> kvikio.cufile_driver.driver_close()
>>> exit()
(kvikio-venv) [strugf@gpu0300 CuFile]$
Cufile_stats = 3
(kvikio-venv) [strugf@gpu0300 CuFile]$ python
Python 3.11.1 (main, May 15 2025, 16:33:59) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import kvikio.cufile_driver
>>> kvikio.cufile_driver.driver_open()
>>> kvikio.cufile_driver.driver_close()
>>> exit()
Segmentation fault (core dumped)
This has been fixed in CUDA-13.0 which would be available pretty soon.