view-fusion Off-by-one Error in Data Shards (and Empty Shards)

Off-by-one Error in Data Shards (and Empty Shards)

Open mustious opened this issue 10 months ago • 0 comments

This PR aims to improve the existing logic in data sharding of view-fusion work.

The existing logic leads to an off-by-one error in the number of shards, i.e. The data sharding command of python data/dataset_prep.py—sc 4 results in 5 shards, which might not be the user's intention. Also, the additional shards are empty in some cases, leading to errors in the distributed training process.

With this new logic, all shards except one have the same number of samples. For example

split = “train” split_dict_values_sum = 30661 shard_count = 4

train_shard_00, train_shard_01, train_shard_02 = 7665
train_shard_03 = 7666

Apr 04 '24 15:04 mustious

view-fusion view-fusion copied to clipboard

Off-by-one Error in Data Shards (and Empty Shards)

view-fusion
view-fusion copied to clipboard