DALI icon indicating copy to clipboard operation
DALI copied to clipboard

DALI GDS with fn.readers.numpy

Open SuperJarvis opened this issue 1 year ago • 2 comments

Describe the question.

My training data consists of a large number of npy files, each with a file size of around 50k. The shape of each file is (20*700), and no preprocessing is required. When comparing DALI with the combination of GDS-supported fn.readers.numpy and a regular dataloader (loading into CPU, then transferring a batch_size of data to GPU), I found that DALI's speed did not surpass the dataloader. I also attempted to improve performance by modifying parameters such as prefetch_queue_depth and dont_use_mmap, as suggested in the documentation. Interestingly, at times, the reading speed displayed in iotop was faster (80M/s -> 110M/s), but there was no significant improvement in tqdm and actual runtime. I did not observe CPU or GPU usage as bottlenecks. Additionally, with persistent_workers=True in the dataloader, the speed in the second epoch noticeably increased, while DALI seems to have consistent runtime for each epoch. By the way, I adjusted os.environ['DALI_GDS_CHUNK_SIZE'] to '64k,' which is the fastest setting for this dataset. I really want to know how to leverage DALI's advantages in loading and would appreciate any help. Below is the code for testing DALI:

os.environ['DALI_GDS_CHUNK_SIZE']` = '64k'

data_dir = '/mnt/nvme/data/model'

@pipeline_def(batch_size=1024, num_threads=16, device_id=0)
def pipe_gds():
    data = fn.readers.numpy(device='gpu', file_root=data_dir, file_filter = '*.npy', name='Reader', prefetch_queue_depth=16, dont_use_mmap=True)
    return data

dali_iter = DALIGenericIterator([pipe_gds()], ['feature'], reader_name='Reader', dynamic_shape=True)

for _ in range(20):
    start = time.time()
    for batch in tqdm(dali_iter, ncols=100):
        pass
    print(time.time() - start)

Check for duplicates

  • [X] I have searched the open bugs/issues and have found no duplicates for this bug report

SuperJarvis avatar Oct 11 '23 09:10 SuperJarvis

Hi @SuperJarvis,

Thank you for reaching out. First of all, GDS works best for bigger files to make the performance optimal the file is registered in the GDS and this imposes an additional overhead. For multiple small reads it doesn't pay off. What you can do is also try out the use_o_direct option and read data on the CPU.

JanuszL avatar Oct 11 '23 11:10 JanuszL

Hello @SuperJarvis

Additionally, with persistent_workers=True in the dataloader, the speed in the second epoch noticeably increased, while DALI seems to have consistent runtime for each epoch

This indicates that your dataset is small enough to fit in RAM and the subsequent passes just take the data from the cache. If that's the case, GDS and o_direct will actually slow things down, because, as the name suggests, they use direct (unbuffered) I/O. They are best suited for very large datasets, that cannot benefit from caching and for which the buffering by the OS is just an avoidable overhead.

mzient avatar Oct 11 '23 14:10 mzient