Quentin Lhoest

Results 416 comments of Quentin Lhoest

I tested this tokenize function and indeed noticed a casting. However it seems to only concerns the `offset_mapping` field, which contains a list of tuples, that is converted to a...

Setting the output to `"np"` makes the whole pipeline fast because it moves the data buffers from rust to python to arrow using zero-copy, and also because it does eliminate...

Cool ! Sure feel free to follow these instructions to open a PR :) thanks !

Hi ! Sorry for the late response I agree `interleave_datasets` would benefit a lot from having more flexibility. If I understand correctly it would be nice to be able to...

Hi ! That would be awesome to have them indeed, thanks for opening this issue I just added you to the WMT org on the HF Hub if you're interested...

Hi ! I just re-ran a quick benchmark and using `to_numpy()` seems to be faster now: ```python import pyarrow as pa # I used pyarrow 3.0.0 import numpy as np...

I created https://github.com/huggingface/datasets/pull/2505 if you want to play with it @vblagoje

It looks like the data host doesn't support http range requests, which is necessary to glob inside a ZIP archive in streaming mode. Can you try hosting the dataset elsewhere...

Related to this discussion: in https://github.com/huggingface/datasets/pull/3664#issuecomment-1031866858 I propose how we could change `iter_archive` to work for streaming and also return local paths (as it used too !). I'd love your...