streaming
streaming copied to clipboard
Estimate total shards at the beginning of data conversion
🚀 Feature Request
Number of shards that would be created, estimated with help of size_limit and data size can be a useful metric.
Motivation
If in future, other features such as resume data conversion etc are implemented , it could be built with the help of this feature.
[Optional] Implementation
Additional context
Hey @abhijithneilabraham thanks for this issue! How would you propose finding the dataset size ahead of time? MDSWriter currently has no knowledge of how large your raw dataset files are or how it is being used to iterate over your original dataset...