tokenizers
tokenizers copied to clipboard
Inconsistencies between documentation and API
There are several issues with the documentation and the .pyi
stub files:
- The documentation does not mention decoders at all.
- In the stub file for
Tokenizer
, all of the properties likepre_tokenizer
,post_processor
etc. are not marked as writable, resulting in warnings from the IDE. This is in contrast topadding
andtruncation
, which are actually not writable. - The
PreTokenizer
s,Processor
s andDecoder
s accept arbitrary parameters without warning -
BPE
(and maybe other models) accept arbitrary parameters, outputting a warning but not an exception - The
trim_offsets
option of theByteLevel
preprocessor is present in the json, but cannot be set through the API. For the corresponding decoder, bothtrim_offsets
andadd_prefix_space
cannot be set. Maybe these the result of differences betweentransformers
andtokenizers
, but if that is the case, this could be improved. - In the
Tokenizer.save
method, thefile
option is documented, but not present in the stub file.
Hi @felix-schneider ,
Yes the .pyi
are intended as help, and AFAIK there's now way to make them perfectly consistent (as the bindings are in Rust, so it's custom code anyway).
But as you mention there's definitely room for improvement. Internally it's a bit low priority since .pyi
are only helpers for some text editors, and any help here is appreciated.
If you want to help though all the elements are contained in the Rust code (https://github.com/huggingface/tokenizers/blob/master/bindings/python/src/pre_tokenizers.rs#L248 for instance).
The docstring is the Rust docstring and the signature produced in the .pyi
is the text_signature
.
Then the .pyi
are generated with this script: https://github.com/huggingface/tokenizers/blob/master/bindings/python/stub.py (shouldn't require any change).
Does that help ?
Hi @Narsil, thanks for your answer.
I can see what I can do. However, I cannot document decoder
because I have no information about it. What does a decoder do? It is not mentioned in the docs at all.
Is there a way to make the generated bindings smarter? Would it be too much effort to maintain to write the bindings by hand? Stub files go a long way towards making code easier to use and they are parsed by all major IDEs (not "some text editors").
Right now the library is quite clunky to use, especially without the wrapper from transformers
.
The doc for decoder
is here: https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=decoder#decoders
As it mentions, it's a way to revert some modifications when getting back text (while decoding
:D).
The generated bindings can definitely made smarter, the stub.py
file link is the starting place I think.
However, I can remember that adding types wasn't easy (if possible) especially the __text_signature
part of Python wouldn't work and if types have to be imported it's even worse.
I also remember than different editors would parse the .pyi
differently so we need to check that it works across most vendors
You also have to account for help(xxx)
within a console that will also parse the signature and can also have trouble with type annotations.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.