datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Batched filter inputs?

Open stas00 opened this issue 7 months ago • 6 comments

This is a very cool library! Kudos to the authors!

The Filter API seems to be only working with a single item at a time.

Is there a way to filter in batches? Say you're using a filter that uses an ml model inference. It'd be much more efficient to infer large batches, than 1 item at a time.

I looked around the examples and code in case I have missed it, but I don't seem to find any suggestions that batched input is supported.

The API I think would be similar to the HF Tokenizer where it takes batches and returns batches, so here instead of returning a bool, it'd return a list of bools. If the input is a single sample, return a single bool - if a list, return a list.

Thanks a lot!

stas00 avatar Jul 04 '24 00:07 stas00