End-of-word suffix is not serialized for WordPiece
If you save a WordPiece model to JSON, it won't contain end_of_word_suffix. It seems to be missing from the WordPiece serialization - see src/models/wordpiece/serialization.rs compared to src/models/bpe/serialization.rs.
I took a stab at it in #569 and realized that the issue is not serialization. Rather, as the suffix is not supported throughout the WordPiece model. The curious thing though is that it's presented as an option in the Python interface and the vocab is generated with the suffix provided.
My guess is that this is due to how WordPiece is a wrapper around BPE (see #570), but I don't understand this codebase enough to really know. Anyway, would be great to have it either supported with serialization or not supported at all :)
Thank you for reporting this! See https://github.com/huggingface/tokenizers/issues/570 for the explanation about the differences.
We should definitely remove the end_of_word_suffix option from the WordPieceTrainer as it makes absolutely no sense to use it. It should never have been added in the first place :smile:
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.