DALI icon indicating copy to clipboard operation
DALI copied to clipboard

Question : Order in which files are fetched

Open mritunjaymusale opened this issue 3 years ago • 4 comments

Hi, I'm using BDD100k images stored on gdrive and fetch them using Dali + drive.mount() provided by colab. I wanted to know that, since the mounting and fetching process from drive is async and isn't sequential in terms of files fetched, will dali enforce output tensors based on availability of files or enforce sequential order when fetching the files from drive ? If my question isn't clear here's my code snippet :

gdrive_src = '.../bdd100k'
 
image_src = os.path.join(gdrive_src, 'images', 'seg_track_20', 'train', )
mask_src = os.path.join(gdrive_src, 'labels',
                       'seg_track_20', 'bitmasks', 'train',)


batch_size = 1


@pipeline_def(seed=0000)
def image_decoder_pipeline():
   """ Create a pipeline which reads images and masks, decodes the images and returns them. """
   img_files, __ = fn.readers.file(file_root=image_src, )
   mask_files, _ = fn.readers.file(file_root=mask_src, )
   images = fn.decoders.image(
       img_files, device="mixed", output_type=types.DALIImageType.RGB)
   masks = fn.decoders.image(
       mask_files, device="mixed", output_type=types.DALIImageType.GRAY)
   return images, masks,


pipe = image_decoder_pipeline(
   batch_size=batch_size, num_threads=2, device_id=0)



dali_iter = DALIGenericIterator(pipe, ['image', 'mask'], )


for i, data in enumerate(dali_iter):

   start = time.time()
   image = data[0]['image'] # <----  a.jpg
   mask = data[0]['mask'] #<---- will this be from a.png(same name as source image) aswell or different file due async fetching from drive ?
   end = time.time()
   print("with-dali :", end - start)

mritunjaymusale avatar Feb 23 '22 06:02 mritunjaymusale

Hi @mritunjaymusale !

If I understood correctly, your images and masks have the same name, but different path, so let's say:

//gdrive/images/seg_track_20/train/0001.jpg
//gdrive/labels/seg_track_20/bitmasks/train/0001.jpg

When using file_root mode without file_list or files arguments, the directory specified in file_root is going to be traversed beforehand to discover files. DALI creates the list of files inside and then reads them. It does not matter if the fetching and mounting is async in this case, since DALI will proceed only when it reads the whole batch.

One thing to pay attention to is that you want to have the same order of reading files from both file_root paths. To do so you need to pass the same seed to both readers:

   myseed = 42

   img_files, __ = fn.readers.file(file_root=image_src, seed=myseed)
   mask_files, _ = fn.readers.file(file_root=mask_src, seed=myseed)

Keep in mind, that if there is any, even tiniest, discrepancy between these two paths (e.g. one file missing), the order will be broken and images and masks will be misaligned. To alleviate this issue, you might consider using WebDataset format to keep the images and masks together all the time.

In case you have any more questions, don't hesitate to ask further :)

szalpal avatar Feb 23 '22 09:02 szalpal

@mritunjaymusale DALI doesn't take file availability in a network source into account, so if you happen to know the order in which they will be available, your best shot is to provide the list of files explicitly - this is not ideal, however, since the order will be totally fixed unless you decide to recreate the pipeline in each epoch (which might not be as expensive as it sounds, especially with latest releases). You can also use external_source and read the data yourself, e.g. with numpy.fromfile.

mzient avatar Feb 23 '22 16:02 mzient

Turns out my training folder was inconsistent with the number of mask files I had, so a quick diff and delete between folders operation later I was able to get consistency. In terms of seed, using the @pipeline_def(seed=0000) would be sufficient enough to apply seed to my readers or do I need to explicitly tell what seeds I need ? In terms of order, I have stuck with the implementation I have shown before, since doing that way forces colab to fetch the files and cache them locally on colab, I tried @mzient files parameter approach but that causes a file error(i forgot what the name was) which was basically saying that the file wasn't present at that location (this alleviated by just using file_root since GenericIterator goies through all the files beforehand which puts the files in cache of colab vm).

mritunjaymusale avatar Feb 24 '22 15:02 mritunjaymusale

@mritunjaymusale No, the seed set at the pipeline level is a meta-seed, used by a generator which provides all operator-level seeds. It guarantees repeatability across runs, but different operators will get different seeds. To make the operators traverse the directories in the same order, you have to pass the seed to the readers.

mzient avatar Feb 25 '22 15:02 mzient