tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Adding treat_whitespace_as_suffix as a new feature to sentencepiece?

Open Smu-Tan opened this issue 2 years ago • 2 comments

Hi, I'd like to suggest perhaps adding --treat_whitespace_as_suffix as a new feature to sentencepiece? since the original sentencepiece has this feature (see here) I think it would be great for Huggingface also to have this. Thanks!

Smu-Tan avatar Nov 20 '22 19:11 Smu-Tan

@Smu-Tan you're more than welcome to contribute it if you want.

In general this library doesn't really follow spm architecture where normalizing and pre_tokenization is a separate step from the core algorithm.

Don't expect this to be simple, but this lib would benefit tremedously from this.

Narsil avatar Nov 21 '22 08:11 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jan 17 '24 01:01 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 23 '24 01:02 github-actions[bot]