kvikio icon indicating copy to clipboard operation
kvikio copied to clipboard

Support Cloud Solution Provider

Open madsbk opened this issue 3 years ago • 5 comments

cufile.so might crash when used within a VM in the cloud. KvikIO should detect this and fallback to its own implementation.

@gigony

madsbk avatar Aug 01 '22 07:08 madsbk

Out of curiosity why would it crash on a cloud VM?

jacobtomlinson avatar Aug 02 '22 08:08 jacobtomlinson

I don't know, @gigony do you know?

madsbk avatar Aug 02 '22 10:08 madsbk

It seems that there is a logic in the cuFileDriverOpen() method that assumes specific device mounts that crash when the assumption fails. It is the same for WSL2. I shared the information in GDS team and it is a bug. Filed a bug to address the issue.

  • https://nvidia.slack.com/archives/CJ5FK152R/p1658945158190859
  • https://nvidia.slack.com/archives/CJ5FK152R/p1658945626109219?thread_ts=1658945272.452739&cid=CJ5FK152R
  • https://nvidia.slack.com/archives/CJ5FK152R/p1658946372829689?thread_ts=1658945769.993249&cid=CJ5FK152R

gigony avatar Aug 02 '22 17:08 gigony

xref: https://github.com/rapidsai/cucim/issues/346

madsbk avatar Sep 05 '22 15:09 madsbk

Hello @madsbk @gigony , Has this issue been resolved? Im using CUDA-11.7 and still facing the error when installing GDS on a VM:

============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 =========
 GPU INFO:
 =========
 GPU index 0 Tesla V100-PCIE-16GB bar:1 bar size (MiB):16384 supports GDS
 ==============
 PLATFORM INFO:
 ==============
Assertion failure, file index :cufio-udev  line :134

UTKRISHTPATESARIA avatar Apr 07 '23 14:04 UTKRISHTPATESARIA

AFAICT, KvikIO should detect this now

madsbk avatar Jun 26 '24 13:06 madsbk