datatrove
datatrove copied to clipboard
Batched filter inputs?
This is a very cool library! Kudos to the authors!
The Filter API seems to be only working with a single item at a time.
Is there a way to filter in batches? Say you're using a filter that uses an ml model inference. It'd be much more efficient to infer large batches, than 1 item at a time.
I looked around the examples and code in case I have missed it, but I don't seem to find any suggestions that batched input is supported.
The API I think would be similar to the HF Tokenizer where it takes batches and returns batches, so here instead of returning a bool, it'd return a list of bools. If the input is a single sample, return a single bool - if a list, return a list.
Thanks a lot!