view-fusion
view-fusion copied to clipboard
Off-by-one Error in Data Shards (and Empty Shards)
This PR aims to improve the existing logic in data sharding of view-fusion work.
The existing logic leads to an off-by-one error in the number of shards, i.e.
The data sharding command of python data/dataset_prep.py—sc 4
results in 5 shards, which might not be the user's intention. Also, the additional shards are empty in some cases, leading to errors in the distributed training process.
With this new logic, all shards except one have the same number of samples. For example
split = “train” split_dict_values_sum = 30661 shard_count = 4
train_shard_00, train_shard_01, train_shard_02 = 7665
train_shard_03 = 7666