ops.readers.Numpy how to return the filename
Describe the question.
Hi is there a way to return the filename of the loaded data or a way to get that information, when shuffle is turned on? Thanks in advance.
Check for duplicates
- [x] I have searched the open bugs/issues and have found no duplicates for this bug report
Hi @rachelglenn,
Thank you for reaching out.
Have you tried the source_info method for the sample in the output batch?
o = pipe.run()
print(o[0][0].source_info())
I build my pipeline with a graph and have tried using either DALIGenericIterator, DALIRaggedIterator as the iterators. I am not able to get the filename. I have tried following this issue:
``import nvidia.dali as dali from nvidia.dali.plugin.pytorch import DALIGenericIterator
Define the DALI pipeline
class NumpyReaderPipeline(dali.Pipeline): def init(self, batch_size, num_threads, device_id, files, seed, shuffle, shard_id, num_shards): super(NumpyReaderPipeline, self).init(batch_size, num_threads, device_id) self.files = files self.seed = seed self.shuffle = shuffle self.shard_id = shard_id self.num_shards = num_shards
# Define the Numpy reader operator
self.reader = dali.ops.readers.Numpy(
seed=self.seed,
files=self.files,
device="cpu",
read_ahead=True,
shard_id=self.shard_id,
pad_last_batch=True,
num_shards=self.num_shards,
dont_use_mmap=True,
shuffle_after_epoch=self.shuffle,
)
def define_graph(self):
# Get the data from the reader
data = self.reader()
# Get the source_info for filenames (runtime property)
source_info = dali.fn.get_property(data, "source_info")
return data, source_info
Define input parameters
files = ["file1.npy", "file2.npy", "file3.npy"] # Example file paths batch_size = 2 num_threads = 2 device_id = 0 seed = 42 shuffle = True shard_id = 0 num_shards = 1
Create the pipeline
pipe = NumpyReaderPipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id, files=files, seed=seed, shuffle=shuffle, shard_id=shard_id, num_shards=num_shards)
Build the pipeline
pipe.build()
Create a DALI Generic Iterator
iterator = DALIGenericIterator([pipe], output_map=["data", "source_info"], auto_reset=True)
Run the pipeline and print the data and source info (filenames)
for data in iterator: images = data[0]["data"] # Get the image data from the iterator source_info = data[0]["source_info"] # Get the filenames (source info)
print("Batch of images:", images)
print("Source info (filenames):", source_info)
``
I'm afraid the method mentioned is a property of the DALI tensor, not the Torch one, which is returned by the iterator.
Another solution you can test in this case is to use the external source operator and for each file read to return a unique numerical ID that can be mapped to the file name.
Thank you for the help and a possible work around. Do you know of any examples that use the external source operator?
@rachelglenn have you checked this example?