DALI during AI training prepare phase and training phase, how much temp files created, are they very short lived files?

Describe the question.

Hello dear NV expert,

I would like to understand during AI training, data prepare, training model phase, how many temp files will be created, how much size they are, what is the lifetime for these files? is it very short life files?

could you please point to me if there is any table to show this kind of information for different AI senarios?

Check for duplicates

[X] I have searched the open bugs/issues and have found no duplicates for this bug report

Nov 11 '23 15:11 gaowayne

Hi @gaowayne,

Thank you for reaching out. If I understand correctly your questions regarding the temporary files DALI uses, I would say it doesn't use any, besides the shared memory pseudo-files created by the parallel external source. However, they consume RAM, not the disc space. Can you tell me more about why you are concerned about this particular aspect? Have you observed anything that causes issues during your network training process?

Nov 13 '23 08:11 JanuszL

Hi @gaowayne,

Thank you for reaching out. If I understand correctly your questions regarding the temporary files DALI uses, I would say it doesn't use any, besides the shared memory pseudo-files created by the parallel external source. However, they consume RAM, not the disc space. Can you tell me more about why you are concerned about this particular aspect? Have you observed anything that causes issues during your network training process?

thank you so much JanuszL!~ I would like to understand AI training process, what kind of temp files are created and how is it lifetime, for example, we have different SSD, SLC, TLC, QLC, we can do data placement on them correctly based on the files lifetime. for example, if source dataset is just ingested into storage system, it can write to QLC device directly since source dataset is static data. for distributed file system WAL/Metadata, it can write to SLC because stable latency and small capacity fit. for the AI training checkpoints that I think we only need save last N checkpoints, it is temp files too, because higher storage BW, we can write these into gen5 TLC. this is my roughly idea, but I do not know what kind of temp files during data preparation, data training, data inference steps. could you please shed me light? for example if DALI will split video, or resize images from source dataset to a format that training may need? if you need offline talk, here is my email address [email protected]

Nov 13 '23 12:11 gaowayne

Hi @gaowayne,

I'm afraid that I'm not able to provide a broad overview of all files that can be created during the training as everything depends on the network architecture and particular implementation. Regarding the data processing with DALI, DALI (as I mentioned) doesn't create any intermediate files, and it processes all the data on the fly online. Regarding the data set storage I recommend something that is fast to read - for example, local raid, as while DALI can accelerate the processing in many cases it cannot do much with the read speed from the storage itself.

Nov 13 '23 12:11 JanuszL

Hi @gaowayne,

I'm afraid that I'm not able to provide a broad overview of all files that can be created during the training as everything depends on the network architecture and particular implementation. Regarding the data processing with DALI, DALI (as I mentioned) doesn't create any intermediate files, and it processes all the data on the fly online. Regarding the data set storage, I recommend something that is fast to read - for example, local raid, as while DALI can accelerate the processing in many cases it cannot do much with the read speed from the storage itself.

I see thank you so much @JanuszL, SLC/TLC/QLC can offer 6GB/s read, gen5 TLC can above 10GB/s, for read I think it should be enough for RDMA NIC and GPU, read BW will not be bound for entire system. but for write path, if we can place data correctly on different media, we can get better leverage each media advantage and saving cost. drop SSD internal WAF. not sure if you can help me bridge NV expert on data preparation, training and inference, I would like to understand the writes into detail. :)

Nov 13 '23 12:11 gaowayne

I think you can start by asking a question on our NVIDIA dev forum and see how that goes.

Nov 13 '23 12:11 JanuszL