streaming
streaming copied to clipboard
Resume of data conversion?
🚀 Feature Request
Motivation
When converting a huge dataset, I would like to resume the conversion process when it failed on the way.
[Optional] Implementation
Additional context
@karan6181 @knighton this seems like an interesting feature, I would like to work on this
@huxuan I attempted to solve the issue, but here are the major blockers:
- For resuming from a particular shard upon interruption, we need to know the total shards to mark completion, but while each shard is created dynamically, it's hard to estimate the total shards during the beginning of the script.
- Resuming from a particular shard means splitting the data from that point, which might break the atomicity as an exact data split can be unclear.
I will raise an issue to at least have a feature to estimate the total shards that would be created, so that in future maybe resuming from a particular shard index can be possible.
- For resuming from a particular shard upon interruption, we need to know the total shards to mark completion, but while each shard is created dynamically, it's hard to estimate the total shards during the beginning of the script.
In my current data packaging implementation, the data is provided by a IterableDataset, and a None value is used to indicate the end of the whole dataset.
- Resuming from a particular shard means splitting the data from that point, which might break the atomicity as an exact data split can be unclear.
I did not deeply looked into the implementation of MDSWriter, but I suppose it should not change the original behavior. It should be OK to delete and generate the broken shard if possible.
@huxuan @abhijithneilabraham This isn't something that's currently on our roadmap right now, but if you have ideas for how we would improve MDSWriter to make this functionality possible, that would be great! We always appreciate open source contributions :)
For anyone might be interested, a workaround is used for resume.
- The data is splitted into subsets with a predefined chunk size, for example, 10000.
- The data conversion script will maintain the state of each subset.
- The data conversion script will skip those subsets whose state is "Done" and will overwrite the specific subset when it is not.
This works fine in my scenario with tens of millions of video clips.
As a newcomer to this package, @huxuan's workaround isn't quite clear to me. Would it be possible to have some pseudocode for it? It isn't really clear to me how a single MDSWriter would work with the predefined splits.