DALI icon indicating copy to clipboard operation
DALI copied to clipboard

ops.readers.Numpy how to return the filename

Open rachelglenn opened this issue 11 months ago • 5 comments

Describe the question.

Hi is there a way to return the filename of the loaded data or a way to get that information, when shuffle is turned on? Thanks in advance.

Check for duplicates

  • [x] I have searched the open bugs/issues and have found no duplicates for this bug report

rachelglenn avatar Jan 17 '25 13:01 rachelglenn

Hi @rachelglenn,

Thank you for reaching out. Have you tried the source_info method for the sample in the output batch?

o = pipe.run()
print(o[0][0].source_info())

JanuszL avatar Jan 17 '25 13:01 JanuszL

I build my pipeline with a graph and have tried using either DALIGenericIterator, DALIRaggedIterator as the iterators. I am not able to get the filename. I have tried following this issue:

``import nvidia.dali as dali from nvidia.dali.plugin.pytorch import DALIGenericIterator

Define the DALI pipeline

class NumpyReaderPipeline(dali.Pipeline): def init(self, batch_size, num_threads, device_id, files, seed, shuffle, shard_id, num_shards): super(NumpyReaderPipeline, self).init(batch_size, num_threads, device_id) self.files = files self.seed = seed self.shuffle = shuffle self.shard_id = shard_id self.num_shards = num_shards

    # Define the Numpy reader operator
    self.reader = dali.ops.readers.Numpy(
        seed=self.seed,
        files=self.files,
        device="cpu",
        read_ahead=True,
        shard_id=self.shard_id,
        pad_last_batch=True,
        num_shards=self.num_shards,
        dont_use_mmap=True,
        shuffle_after_epoch=self.shuffle,
    )

def define_graph(self):
    # Get the data from the reader
    data = self.reader()
    # Get the source_info for filenames (runtime property)
    source_info = dali.fn.get_property(data, "source_info")
    return data, source_info

Define input parameters

files = ["file1.npy", "file2.npy", "file3.npy"] # Example file paths batch_size = 2 num_threads = 2 device_id = 0 seed = 42 shuffle = True shard_id = 0 num_shards = 1

Create the pipeline

pipe = NumpyReaderPipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id, files=files, seed=seed, shuffle=shuffle, shard_id=shard_id, num_shards=num_shards)

Build the pipeline

pipe.build()

Create a DALI Generic Iterator

iterator = DALIGenericIterator([pipe], output_map=["data", "source_info"], auto_reset=True)

Run the pipeline and print the data and source info (filenames)

for data in iterator: images = data[0]["data"] # Get the image data from the iterator source_info = data[0]["source_info"] # Get the filenames (source info)

print("Batch of images:", images)
print("Source info (filenames):", source_info)

``

rachelglenn avatar Jan 19 '25 13:01 rachelglenn

I'm afraid the method mentioned is a property of the DALI tensor, not the Torch one, which is returned by the iterator. Another solution you can test in this case is to use the external source operator and for each file read to return a unique numerical ID that can be mapped to the file name.

JanuszL avatar Jan 19 '25 20:01 JanuszL

Thank you for the help and a possible work around. Do you know of any examples that use the external source operator?

rachelglenn avatar Jan 19 '25 23:01 rachelglenn

@rachelglenn have you checked this example?

JanuszL avatar Jan 20 '25 07:01 JanuszL