fio icon indicating copy to clipboard operation
fio copied to clipboard

FIO deleting files when ioengine is libcufile

Open wvaske opened this issue 4 years ago • 1 comments

Please acknowledge the following before creating a ticket

  • [x ] I have read the GitHub issues section of REPORTING-BUGS.

Description of the bug: When I run a test against a filesystem with the ioengine libcufile, fio deletes the files before doing load even when overwrite=1.

Environment: NVIDIA DGX A100 DGX OS 4.99 (Ubuntu 18.04)

root@host:~# gds-tools/gdscheck -p
 GDS release version (beta): 0.95.0.58
 nvidia_fs version:  2.6 libcufile version: 2.3
 cuFile CONFIGURATION:
 NVMe           : Supported
 NVMeOF         : Supported
 SCSI           : Unsupported
 SCALEFLUX CSD  : Unsupported
 NVMesh         : Unsupported
 LUSTRE         : Unsupported
 GPFS           : Unsupported
 NFS            : Unsupported
 WEKAFS         : Unsupported
 USERSPACE RDMA : Unsupported
 --MOFED peer direct  : enabled
 --rdma library       : Not Loaded (libcufile_rdma.so)
 --rdma devices       : Not configured
 --rdma_device_status : Up: 0 Down: 0
 properties.use_compat_mode : 1
 properties.use_poll_mode : 0
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : 0
 fs.lustre.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: 0
 profile.nvtx : 0
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : 0
 GPU INFO:
 GPU index 0 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 1 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 2 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 3 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 4 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 5 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 6 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 7 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 IOMMU : enabled
 Platform verification succeeded

fio version: fio-3.26-23-g6202c-dirty

Reproduction steps Run a test with file creation allowed then disallowed with overwrite enabled:

root@host# /root/fio/fio --ioengine=libcufile --cuda_io=cufile --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting  --name=gpu0 --directory=/gds/fio/172.16.18.24 --gpu_dev_ids=0 --numa_cpu_nodes=0 --allow_file_create=1 --output=/root/allow_file_create.txt

root@host# /root/fio/fio --ioengine=libcufile --cuda_io=cufile --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting  --name=gpu0 --directory=/gds/fio/172.16.18.24 --gpu_dev_ids=0 --numa_cpu_nodes=0 --allow_file_create=0 --output=/root/not_allow_file_create.txt
fio: file creation disallowed by allow_file_create=0

When I run with --ioengine=posixaio, the files are kept and used as expected:

root@host# /root/fio/fio --ioengine=posixaio --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting  --name=gpu0 --directory=/gds/fio/172.16.18.24 --numa_cpu_nodes=0 --allow_file_create=1 --output=/root/allow_file_create_posixaio.txt

root@host# /root/fio/fio --ioengine=posixaio --direct=1 --runtime=20 --time_based=1 --numjobs=1 --bs=4k --rw=randread --filename_format=\$jobname --overwrite=1 --size=1Gi --group_reporting  --name=gpu0 --directory=/gds/fio/172.16.18.24 --numa_cpu_nodes=0 --allow_file_create=0 --output=/root/not_allow_file_create_posixaio.txt
(no error returned here)

wvaske avatar Apr 02 '21 15:04 wvaske

Unfortunately, I don't currently have access to the hardware needed to test this. However, I can confirm that I have seen this problem. It's a little weird, because the libcufile engine doesn't make any decisions about when to layout files.

I'm having a similar problem with psync engine against BeeGFS. The --readonly argument (a command line argument, it does not work in the job file) appears to have corrected that issue. Can you give --readonly a shot with libcufile?

bsmith94 avatar Jun 12 '21 01:06 bsmith94