Quentin Lhoest comments

Results 416 comments of


                                            Quentin Lhoest

parallel searching in multi-gpu setting using faiss

Ok the benchmark is great, not sure why it doesn't speed up the index in your case though. You can try running the benchmark with the same settings as your...

Dataset Viewer issue for yanekyuk/wikikey

The dataset is empty, as far as I can tell: there are no files in the repository at https://huggingface.co/datasets/yanekyuk/wikikey/tree/main Maybe the viewer can display a better message for empty datasets

Domain specific dataset discovery on the Hugging Face hub

I opened a PR to add "tags" to the YAML validator: https://github.com/huggingface/datasets/pull/4716 I also added "tags" to the [tagging app](https://huggingface.co/spaces/huggingface/datasets-tagging), with suggestions like "bio" or "newspapers"

Domain specific dataset discovery on the Hugging Face hub

I think they're not displayed, but at least it should enable users to filter by tag in using `huggingface_hub` or using the appropriate query params on the website (not sure...

PyArrow Dataset error when calling `load_dataset`

Hi ! It looks like a bug in `pyarrow`. If you manage to end up with only one chunk per parquet file it should workaround this issue. To achieve that...

PyArrow Dataset error when calling `load_dataset`

Actually this is probably linked to this open issue: https://issues.apache.org/jira/browse/ARROW-5030. setting `max_shard_size="2GB"` should do the job (or `max_shard_size="1GB"` if you want to be on the safe side, especially given that...

Datasets' cache not re-used

I think this is because the tokenizer is stateful and because the order in which the splits are processed is not deterministic. Because of that, the hash of the tokenizer...

Datasets' cache not re-used

Actually this is not because of the order of the splits, but most likely because the tokenizer used to process the second split is in a state that has been...

Datasets' cache not re-used

Sorry didn't have the bandwidth to take care of this yet - will re-assign when I'm diving into it again !

Datasets' cache not re-used

Not sure why the second one would work and not the first one - they're basically the same with respect to hashing. In both cases the function is hashed recursively,...