datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Update items in the dataset without `map`

Open mashdragon opened this issue 8 months ago • 1 comments

Feature request

I would like to be able to update items in my dataset without affecting all rows. At least if there was a range option, I would be able to process those items, save the dataset, and then continue.

If I am supposed to split the dataset first, that is not clear, since the docs suggest that any of those functions returns a new object, so I don't think I can do that.

Motivation

I am applying an extremely time-consuming function to each item in my Dataset. Unfortunately, datasets only supports updating values via map, so if my computer dies in the middle of this long-running process, I lose all progress. This is far from ideal. I would like to use datasets throughout this processing, but this limitation is now forcing me to write my own dataset format just to do this intermediary operation.

It would be less intuitive but I suppose I could split and then concatenate the dataset before saving? But this feels very inefficient.

Your contribution

I can test the feature.

mashdragon avatar Apr 15 '25 19:04 mashdragon

Hello!

Have you looked at Dataset.shard? Docs

Using this method you could break your dataset in N shards. Apply map on each shard and concatenate them back.

Dref360 avatar Apr 19 '25 18:04 Dref360