streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Resume of data conversion?

Open huxuan opened this issue 1 year ago • 3 comments

🚀 Feature Request

Motivation

When converting a huge dataset, I would like to resume the conversion process when it failed on the way.

[Optional] Implementation

Additional context

huxuan avatar Jul 30 '24 03:07 huxuan

@karan6181 @knighton this seems like an interesting feature, I would like to work on this

abhijithneilabraham avatar Aug 01 '24 17:08 abhijithneilabraham

@huxuan I attempted to solve the issue, but here are the major blockers:

  • For resuming from a particular shard upon interruption, we need to know the total shards to mark completion, but while each shard is created dynamically, it's hard to estimate the total shards during the beginning of the script.
  • Resuming from a particular shard means splitting the data from that point, which might break the atomicity as an exact data split can be unclear.

I will raise an issue to at least have a feature to estimate the total shards that would be created, so that in future maybe resuming from a particular shard index can be possible.

abhijithneilabraham avatar Aug 03 '24 23:08 abhijithneilabraham

  • For resuming from a particular shard upon interruption, we need to know the total shards to mark completion, but while each shard is created dynamically, it's hard to estimate the total shards during the beginning of the script.

In my current data packaging implementation, the data is provided by a IterableDataset, and a None value is used to indicate the end of the whole dataset.

  • Resuming from a particular shard means splitting the data from that point, which might break the atomicity as an exact data split can be unclear.

I did not deeply looked into the implementation of MDSWriter, but I suppose it should not change the original behavior. It should be OK to delete and generate the broken shard if possible.

huxuan avatar Aug 04 '24 01:08 huxuan

@huxuan @abhijithneilabraham This isn't something that's currently on our roadmap right now, but if you have ideas for how we would improve MDSWriter to make this functionality possible, that would be great! We always appreciate open source contributions :)

snarayan21 avatar Sep 16 '24 14:09 snarayan21

For anyone might be interested, a workaround is used for resume.

  1. The data is splitted into subsets with a predefined chunk size, for example, 10000.
  2. The data conversion script will maintain the state of each subset.
  3. The data conversion script will skip those subsets whose state is "Done" and will overwrite the specific subset when it is not.

This works fine in my scenario with tens of millions of video clips.

huxuan avatar Sep 19 '24 03:09 huxuan

As a newcomer to this package, @huxuan's workaround isn't quite clear to me. Would it be possible to have some pseudocode for it? It isn't really clear to me how a single MDSWriter would work with the predefined splits.

tcwalther avatar Mar 27 '25 17:03 tcwalther