sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

user defined char set

Open wenjie-p opened this issue 3 years ago • 4 comments

Hi,

Thanks for this wonderful toolkit you have built! If my understanding is right, this toolkit take all the letters and punctuations as the char set to merge, where each char in the set satisfy: len(char) == 1. While in my exp, I need to take some symbols consisted with two chars (i.e. ab ) as the basic unit to merge. Is it possible for me to accomplish this ?

Thanks in advance!

wenjie-p avatar Apr 18 '21 07:04 wenjie-p

I am also wondering if it is possible to specify a set of basic units to train the BPE/unigram model? Based on this issue, it seems that --user_defined_symbols is designed to specify a set of pre-defined pieces that will not be merged with other pieces, so it does not seem to suit for the purpose.

Thanks in advance!

wnhsu avatar May 07 '21 17:05 wnhsu

Hi @wnhsu , I recently tried to map the basic units I wanted(i.e. two or more chars) into a single char, where the mapping applies to all my text data. So I basically translate my data with a predefined mapping rule. Such method is not very elegant, but I would like to post it here. Not sure how we can accomplish this with the built-in functionality of this toolkit.

wenjie-p avatar May 09 '21 14:05 wenjie-p

I would like to see this as well. I am dealing with a language where the basic units are instructions of the form string:number, for instance

A:32

or

IN:264

The set of instructions is known, and is finite.

A typical "sentence" in my language looks like

A:32;IN:264;H:7;W:3.

Sentencepiece might tokenize this sentence as

['A:32;I', 'N:26', '4;H:7;W:', '3'].

However, I would prefer my basic units not be split.

To work around this apparent limitation of sentencepiece, I have created a mapping where I replace each instruction with a unique unicode character, but it feels quite hacky, it makes my source files difficult to read, and it is difficult to work with.

m-malandro avatar Feb 16 '22 04:02 m-malandro

A related issue/suggestion:

In our setup, we can't output word boundaries "▁" as complete tokens; they can only occur in combination with subsequent characters (e. g. "▁a" or "▁and"). An option to prohibit certain characters or character combinations from forming their own tokens would be appreciated.

I tried circumventing this problem with https://github.com/google/sentencepiece#vocabulary-restriction However, this option doesn't seem to be implemented in Python API.

Thanks

mlmsft avatar Sep 03 '22 00:09 mlmsft