tokenizers
tokenizers copied to clipboard
Adding treat_whitespace_as_suffix as a new feature to sentencepiece?
Hi, I'd like to suggest perhaps adding --treat_whitespace_as_suffix
as a new feature to sentencepiece? since the original sentencepiece has this feature (see here) I think it would be great for Huggingface also to have this. Thanks!
@Smu-Tan you're more than welcome to contribute it if you want.
In general this library doesn't really follow spm
architecture where normalizing and pre_tokenization is a separate step from the core algorithm.
Don't expect this to be simple, but this lib would benefit tremedously from this.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.