Inconsistencies between documentation and API
There are several issues with the documentation and the .pyi stub files:
- The documentation does not mention decoders at all.
- In the stub file for
Tokenizer, all of the properties likepre_tokenizer,post_processoretc. are not marked as writable, resulting in warnings from the IDE. This is in contrast topaddingandtruncation, which are actually not writable. - The
PreTokenizers,Processors andDecoders accept arbitrary parameters without warning BPE(and maybe other models) accept arbitrary parameters, outputting a warning but not an exception- The
trim_offsetsoption of theByteLevelpreprocessor is present in the json, but cannot be set through the API. For the corresponding decoder, bothtrim_offsetsandadd_prefix_spacecannot be set. Maybe these the result of differences betweentransformersandtokenizers, but if that is the case, this could be improved. - In the
Tokenizer.savemethod, thefileoption is documented, but not present in the stub file.
Hi @felix-schneider ,
Yes the .pyi are intended as help, and AFAIK there's now way to make them perfectly consistent (as the bindings are in Rust, so it's custom code anyway).
But as you mention there's definitely room for improvement. Internally it's a bit low priority since .pyi are only helpers for some text editors, and any help here is appreciated.
If you want to help though all the elements are contained in the Rust code (https://github.com/huggingface/tokenizers/blob/master/bindings/python/src/pre_tokenizers.rs#L248 for instance).
The docstring is the Rust docstring and the signature produced in the .pyi is the text_signature.
Then the .pyi are generated with this script: https://github.com/huggingface/tokenizers/blob/master/bindings/python/stub.py (shouldn't require any change).
Does that help ?
Hi @Narsil, thanks for your answer.
I can see what I can do. However, I cannot document decoder because I have no information about it. What does a decoder do? It is not mentioned in the docs at all.
Is there a way to make the generated bindings smarter? Would it be too much effort to maintain to write the bindings by hand? Stub files go a long way towards making code easier to use and they are parsed by all major IDEs (not "some text editors").
Right now the library is quite clunky to use, especially without the wrapper from transformers.
The doc for decoder is here: https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=decoder#decoders
As it mentions, it's a way to revert some modifications when getting back text (while decoding :D).
The generated bindings can definitely made smarter, the stub.py file link is the starting place I think.
However, I can remember that adding types wasn't easy (if possible) especially the __text_signature part of Python wouldn't work and if types have to be imported it's even worse.
I also remember than different editors would parse the .pyi differently so we need to check that it works across most vendors
You also have to account for help(xxx) within a console that will also parse the signature and can also have trouble with type annotations.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.