Quentin Lhoest
Quentin Lhoest
Ok the benchmark is great, not sure why it doesn't speed up the index in your case though. You can try running the benchmark with the same settings as your...
The dataset is empty, as far as I can tell: there are no files in the repository at https://huggingface.co/datasets/yanekyuk/wikikey/tree/main Maybe the viewer can display a better message for empty datasets
I opened a PR to add "tags" to the YAML validator: https://github.com/huggingface/datasets/pull/4716 I also added "tags" to the [tagging app](https://huggingface.co/spaces/huggingface/datasets-tagging), with suggestions like "bio" or "newspapers"
I think they're not displayed, but at least it should enable users to filter by tag in using `huggingface_hub` or using the appropriate query params on the website (not sure...
Hi ! It looks like a bug in `pyarrow`. If you manage to end up with only one chunk per parquet file it should workaround this issue. To achieve that...
Actually this is probably linked to this open issue: https://issues.apache.org/jira/browse/ARROW-5030. setting `max_shard_size="2GB"` should do the job (or `max_shard_size="1GB"` if you want to be on the safe side, especially given that...
I think this is because the tokenizer is stateful and because the order in which the splits are processed is not deterministic. Because of that, the hash of the tokenizer...
Actually this is not because of the order of the splits, but most likely because the tokenizer used to process the second split is in a state that has been...
Sorry didn't have the bandwidth to take care of this yet - will re-assign when I'm diving into it again !
Not sure why the second one would work and not the first one - they're basically the same with respect to hashing. In both cases the function is hashed recursively,...