tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

End-of-word suffix is not serialized for WordPiece

Open bfelbo opened this issue 5 years ago • 3 comments

If you save a WordPiece model to JSON, it won't contain end_of_word_suffix. It seems to be missing from the WordPiece serialization - see src/models/wordpiece/serialization.rs compared to src/models/bpe/serialization.rs.

bfelbo avatar Dec 19 '20 22:12 bfelbo

I took a stab at it in #569 and realized that the issue is not serialization. Rather, as the suffix is not supported throughout the WordPiece model. The curious thing though is that it's presented as an option in the Python interface and the vocab is generated with the suffix provided.

My guess is that this is due to how WordPiece is a wrapper around BPE (see #570), but I don't understand this codebase enough to really know. Anyway, would be great to have it either supported with serialization or not supported at all :)

bfelbo avatar Dec 19 '20 23:12 bfelbo

Thank you for reporting this! See https://github.com/huggingface/tokenizers/issues/570 for the explanation about the differences.

We should definitely remove the end_of_word_suffix option from the WordPieceTrainer as it makes absolutely no sense to use it. It should never have been added in the first place :smile:

n1t0 avatar Jan 06 '21 15:01 n1t0

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 26 '24 01:04 github-actions[bot]