Quentin Lhoest
Quentin Lhoest
Thanks ! Hopefully this can be useful to others, and also to better understand and improve hashing/caching
Thanks for investigating ! Does that mean that `save_pretrained`() produces non-deterministic tokenizers on disk ? Or is it `from_pretrained()` which is not deterministic given the same files on disk ?...
> But Trie is a simple class object, so afaik it's hash function is linked to its id(self) so basically where it's stored in memory, so super highly non deterministic....
> Just to confirm: we should add this metadata via GitHub and not Hub PRs for canonical datasets right? yes :)
Some CI fails are unrelated to your PR and fixed on master, feel free to merge master into your branch :)
Hi ! Thanks for reporting this issue with `wikicorpus`, we implemented a fix in https://github.com/huggingface/datasets/pull/2844
+1 for keeping `token`: the only case the `use_auth_token` semantic is useful is when passing `False`, which is definitely not the common case anyway - therefore renaming the parameter is...
For datasets, we could encourage in the error message to archive the non-lfs files together (ZIP or TAR for example). Archives would be lfs files so it should be fine
cc @adrinjalali @osanseviero do you have an opinion about whether this class should be in `huggingface_hub` or in a separate package `hffs` ?
Given that it requires an extra dependency `fsspec` and for consistency with the other filesystems contributed by the community, I'd be down to create the `hffs` package