Quentin Lhoest

Results 416 comments of Quentin Lhoest

Thanks ! Hopefully this can be useful to others, and also to better understand and improve hashing/caching

Thanks for investigating ! Does that mean that `save_pretrained`() produces non-deterministic tokenizers on disk ? Or is it `from_pretrained()` which is not deterministic given the same files on disk ?...

> But Trie is a simple class object, so afaik it's hash function is linked to its id(self) so basically where it's stored in memory, so super highly non deterministic....

> Just to confirm: we should add this metadata via GitHub and not Hub PRs for canonical datasets right? yes :)

Some CI fails are unrelated to your PR and fixed on master, feel free to merge master into your branch :)

Hi ! Thanks for reporting this issue with `wikicorpus`, we implemented a fix in https://github.com/huggingface/datasets/pull/2844

+1 for keeping `token`: the only case the `use_auth_token` semantic is useful is when passing `False`, which is definitely not the common case anyway - therefore renaming the parameter is...

For datasets, we could encourage in the error message to archive the non-lfs files together (ZIP or TAR for example). Archives would be lfs files so it should be fine

cc @adrinjalali @osanseviero do you have an opinion about whether this class should be in `huggingface_hub` or in a separate package `hffs` ?

Given that it requires an extra dependency `fsspec` and for consistency with the other filesystems contributed by the community, I'd be down to create the `hffs` package