Support Cloud Solution Provider
cufile.so might crash when used within a VM in the cloud.
KvikIO should detect this and fallback to its own implementation.
@gigony
Out of curiosity why would it crash on a cloud VM?
I don't know, @gigony do you know?
It seems that there is a logic in the cuFileDriverOpen() method that assumes specific device mounts that crash when the assumption fails.
It is the same for WSL2.
I shared the information in GDS team and it is a bug. Filed a bug to address the issue.
- https://nvidia.slack.com/archives/CJ5FK152R/p1658945158190859
- https://nvidia.slack.com/archives/CJ5FK152R/p1658945626109219?thread_ts=1658945272.452739&cid=CJ5FK152R
- https://nvidia.slack.com/archives/CJ5FK152R/p1658946372829689?thread_ts=1658945769.993249&cid=CJ5FK152R
xref: https://github.com/rapidsai/cucim/issues/346
Hello @madsbk @gigony , Has this issue been resolved? Im using CUDA-11.7 and still facing the error when installing GDS on a VM:
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
=========
GPU INFO:
=========
GPU index 0 Tesla V100-PCIE-16GB bar:1 bar size (MiB):16384 supports GDS
==============
PLATFORM INFO:
==============
Assertion failure, file index :cufio-udev line :134
AFAICT, KvikIO should detect this now