FIO deleting files when ioengine is libcufile
Please acknowledge the following before creating a ticket
- [x ] I have read the GitHub issues section of REPORTING-BUGS.
Description of the bug: When I run a test against a filesystem with the ioengine libcufile, fio deletes the files before doing load even when overwrite=1.
Environment: NVIDIA DGX A100 DGX OS 4.99 (Ubuntu 18.04)
root@host:~# gds-tools/gdscheck -p
GDS release version (beta): 0.95.0.58
nvidia_fs version: 2.6 libcufile version: 2.3
cuFile CONFIGURATION:
NVMe : Supported
NVMeOF : Supported
SCSI : Unsupported
SCALEFLUX CSD : Unsupported
NVMesh : Unsupported
LUSTRE : Unsupported
GPFS : Unsupported
NFS : Unsupported
WEKAFS : Unsupported
USERSPACE RDMA : Unsupported
--MOFED peer direct : enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
properties.use_compat_mode : 1
properties.use_poll_mode : 0
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : 0
fs.lustre.posix_gds_min_kb: 0
fs.weka.rdma_write_support: 0
profile.nvtx : 0
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : 0
GPU INFO:
GPU index 0 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
GPU index 1 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
GPU index 2 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
GPU index 3 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
GPU index 4 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
GPU index 5 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
GPU index 6 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
GPU index 7 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
IOMMU : enabled
Platform verification succeeded
fio version: fio-3.26-23-g6202c-dirty
Reproduction steps Run a test with file creation allowed then disallowed with overwrite enabled:
root@host# /root/fio/fio --ioengine=libcufile --cuda_io=cufile --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting --name=gpu0 --directory=/gds/fio/172.16.18.24 --gpu_dev_ids=0 --numa_cpu_nodes=0 --allow_file_create=1 --output=/root/allow_file_create.txt
root@host# /root/fio/fio --ioengine=libcufile --cuda_io=cufile --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting --name=gpu0 --directory=/gds/fio/172.16.18.24 --gpu_dev_ids=0 --numa_cpu_nodes=0 --allow_file_create=0 --output=/root/not_allow_file_create.txt
fio: file creation disallowed by allow_file_create=0
When I run with --ioengine=posixaio, the files are kept and used as expected:
root@host# /root/fio/fio --ioengine=posixaio --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting --name=gpu0 --directory=/gds/fio/172.16.18.24 --numa_cpu_nodes=0 --allow_file_create=1 --output=/root/allow_file_create_posixaio.txt
root@host# /root/fio/fio --ioengine=posixaio --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting --name=gpu0 --directory=/gds/fio/172.16.18.24 --numa_cpu_nodes=0 --allow_file_create=0 --output=/root/not_allow_file_create_posixaio.txt
(no error returned here)
Unfortunately, I don't currently have access to the hardware needed to test this. However, I can confirm that I have seen this problem. It's a little weird, because the libcufile engine doesn't make any decisions about when to layout files.
I'm having a similar problem with psync engine against BeeGFS. The --readonly argument (a command line argument, it does not work in the job file) appears to have corrected that issue. Can you give --readonly a shot with libcufile?