deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[FEATURE] Remove one or multiple samples from dataset

Open William1Wu opened this issue 2 years ago • 7 comments

Hi Authors,

Can I use an 'remove' api to remove one or multiple samples from dataset, just like the 'append' api ? The input of 'remove' api maybe is the specified value of tensor or index of tensor. Thanks !

William1Wu avatar Mar 08 '22 06:03 William1Wu

Thanks @William1Wu for raising the feature request, given the current data layout, we have a filtering or dataset views that you can use to hide the specific elements from the dataset.

We are planning to have a removal in the roadmap and will keep you posted on the progress.

davidbuniat avatar Mar 08 '22 21:03 davidbuniat

@davidbuniat Thanks for your reply. Looking forward to the new features.

William1Wu avatar Mar 09 '22 01:03 William1Wu

@William1Wu thank you for posting this feature request. I was also looking for this feature.

@davidbuniat maybe instead of removing data, would it be possible to commit a specific dataset view? And basically when checking out this specific commit, the dataset would be the one from the committed dataset view?

LucasVandroux avatar Mar 14 '22 12:03 LucasVandroux

@William1Wu not sure if this what you are trying to do, but it seems that when doing transformations on existing datasets using parallel computing (as described in this Tutorial) you can remove samples from the datasets (by not appending them to the sample_out) and that the change stay consistent between commits (meaning you can get back the removed samples if you checkout a previous commit for example).

I used this specific method to remove some samples on a specific branch in one of my dataset.

@davidbuniat not sure if this behavior is intended or if I am just abusing something created for a different use.

LucasVandroux avatar Mar 17 '22 00:03 LucasVandroux

@LucasVandroux oh that's an interesting use of @hub.compute transform functions instead of running .filter() with user-defined function. I guess it works! :) (though there is a lot of speedup optimizations potentially can be done behind the scene)

davidbuniat avatar Mar 17 '22 00:03 davidbuniat

@davidbuniat main difference is that the changes can be committed. However, the history in the diff doesn't seem to match, as it says basically that new images were added to the dataset (every image except the ones removed).

I guess having a remove API to remove the samples would definitely be a cleaner and straight forward way to do this.

LucasVandroux avatar Mar 17 '22 00:03 LucasVandroux

yes, you are right, missed that, transforms essentially create a copy of tensors.

working on the remove!

davidbuniat avatar Mar 17 '22 03:03 davidbuniat