DistributeFilesDataset with sharding, num seqs seems incorrect

Open albertz opened this issue 5 months ago • 0 comments

See my recently added test_DistributeFilesDataset_sharding.

I was expecting that global_seq_idx == len(hdf_files) * num_seqs // distrib_size in the end. But this is not the case.

When looking at DistributeFilesDataset.init_seq_order, I wonder about this code:

self_index_base = self.partition_epoch * self._shard_index
self_index_end = self_index_base + self.partition_epoch

The self_index_end here ignores self._num_shards. Is this correct?

(cc @NeoLegends @Icemole @Judyxujj)

Jul 17 '25 09:07 albertz