tokenizers Inconsistencies between documentation and API

There are several issues with the documentation and the .pyi stub files:

The documentation does not mention decoders at all.
In the stub file for Tokenizer, all of the properties like pre_tokenizer, post_processor etc. are not marked as writable, resulting in warnings from the IDE. This is in contrast to padding and truncation, which are actually not writable.
The PreTokenizers, Processors and Decoders accept arbitrary parameters without warning
BPE (and maybe other models) accept arbitrary parameters, outputting a warning but not an exception
The trim_offsets option of the ByteLevel preprocessor is present in the json, but cannot be set through the API. For the corresponding decoder, both trim_offsets and add_prefix_space cannot be set. Maybe these the result of differences between transformers and tokenizers, but if that is the case, this could be improved.
In the Tokenizer.save method, the file option is documented, but not present in the stub file.

Sep 29 '21 14:09 felix-schneider

Hi @felix-schneider ,

Yes the .pyi are intended as help, and AFAIK there's now way to make them perfectly consistent (as the bindings are in Rust, so it's custom code anyway). But as you mention there's definitely room for improvement. Internally it's a bit low priority since .pyi are only helpers for some text editors, and any help here is appreciated.

If you want to help though all the elements are contained in the Rust code (https://github.com/huggingface/tokenizers/blob/master/bindings/python/src/pre_tokenizers.rs#L248 for instance).

The docstring is the Rust docstring and the signature produced in the .pyi is the text_signature. Then the .pyi are generated with this script: https://github.com/huggingface/tokenizers/blob/master/bindings/python/stub.py (shouldn't require any change).

Does that help ?

Oct 05 '21 11:10 Narsil

Hi @Narsil, thanks for your answer.

I can see what I can do. However, I cannot document decoder because I have no information about it. What does a decoder do? It is not mentioned in the docs at all.

Is there a way to make the generated bindings smarter? Would it be too much effort to maintain to write the bindings by hand? Stub files go a long way towards making code easier to use and they are parsed by all major IDEs (not "some text editors").

Right now the library is quite clunky to use, especially without the wrapper from transformers.

Oct 05 '21 15:10 felix-schneider

The doc for decoder is here: https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=decoder#decoders

As it mentions, it's a way to revert some modifications when getting back text (while decoding :D).

The generated bindings can definitely made smarter, the stub.py file link is the starting place I think. However, I can remember that adding types wasn't easy (if possible) especially the __text_signature part of Python wouldn't work and if types have to be imported it's even worse.

I also remember than different editors would parse the .pyi differently so we need to check that it works across most vendors You also have to account for help(xxx) within a console that will also parse the signature and can also have trouble with type annotations.

Oct 05 '21 16:10 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 13 '24 01:03 github-actions[bot]

tokenizers tokenizers copied to clipboard

Inconsistencies between documentation and API

tokenizers
tokenizers copied to clipboard