Incorrect segmentation with SentancePiece?
I've been getting unusual results from the Unigram model, so I dived deeper into the code.
A clear difference between the google implementation and yours is you add the \0 to the end of each word - this isn't done in the google implementation. I believe for good reason (even though they imply it is done). You actually call them out on this in the comments ("Comment suggests we add sentence boundary, but it seems to be missing from actual code in spm.")
Everything I read about suffix arrays implies that when you concatenate together multiple strings, you're supposed to terminate each one with a different terminator, i.e.
"cat","bat","sat" becomes "cat#bat$sat%", giving the suffixes
#,
%
$
at#
at%
at$
bat$
cat#
sat%
t#
t%
t$
If you use the same terminator you get the same suffix array but different auxiliary arrays i.e. the LCP array is invalid as are the auxiliary arrays from the extended suffix array algorithm (using multiple strings is called a generalized suffix array). For example, I use $ as the terminator then the LCP's include the terminator and potentially the start of the next word (at#, at%, at$ all become at$, so the LCP is 3, not 2 and the terminator is included in the substrings). The code which eliminates any substrings containing the terminator eliminates many valid substrings. It's easy to adapt the brute force SA + LCP array to account for terminators but I really don't know how to use the ESA properly in this case - adding a different terminator to each word in the corpus is infeasible!
If you study the google code, given the string "▁cat▁bat▁sat▁bath" (i.e. the corpus is tokenized into words, each word being prefixed with ▁ and then concatenated (ignoring the \0 terminator). I believe that regardless of whether your vocab is words or sentences the end result is the same, everything is concatenated together with the ▁ prefix replacing spaces.
Given the same string, the huggingface ESA algorithm returns the same SA and auxiliary arrays, however, you pass "▁cat\0▁bat\0▁sat\0▁bath\0" instead. The google code then checks for the \0 terminator as does the huggingface code - here you remove many candidates which google doesn't since they omitted to add it in the first place.
Then in the google implementation TrainerInterface::IsValidSentencePiece performs the filtering - this is why I believe they deliberately didn't terminate the strings, the comment says
// Only allows whitespace to appear as a prefix of piece.
// When split_by_whitespace is false, we allow whitespaces to
// appear in the middle, "foo_bar", but do not allow them
// to appear as suffix, "foo_bar_".
// Regardless of the setting of split_by_whitespace,
// whitespace is treated as a prefix/infix of symbol or
// independent symbol.
This code is expecting substrings like "foo_bar" and not "foo\0_bar"
Also, you don't do this check at all - the docstring of is_valid_sentencepiece implies this is taken care of in the pre-tokenizer. My understanding is sentencepiece is designed to (optionally) be trained with a pre-tokenizer but deployed without which I guess that's why spm does the check here.
// Checks string length
// Space not in the substring, numbers, hiragana and more should be taken
// care of within pre_tokenizers.
// https://github.com/google/sentencepiece/blob/26be9516cd81d5315ee31c48d2438018e0eab879/src/trainer_interface.cc#L203
I don't think this comment is valid since the pre-tokenization is undone when the tokens are concatenated back together for the ESA substring enumeration?
In any case, I think at a minimum you need to:
- Not add the
\0terminator - In
is_valid_sentencepiecereturn false ifchar_stringincludes ▁ other than as a prefix
I think this will then replicate spm with split_by_whitespace=true at least.
I'm not sure we want to be 100% spm compliant on the training side. @n1t0 ?
One goal of this library is to be as modular as possible, so taking care of sentencepiece seeds with the pre_tokenizers was actually desirable (so we don't have to rely on extremely custom logic like spm for grouping tokens by words and being forced to use a custom space separator).
Even if we did said modifications (original Unigram version, was almost a copycat of spm so they used to be there) we could never achieve 100% accuracy, because spm uses a mix of f32 and f64, whereas we stick to f64 everywhere. I dived a bit at some point, but after discussions with @n1t0 considered it was not worth it. Instead the idea was to modify Unigram to make it closer to tokenizers library, so trying to decouple pre_tokenizers logic from actual Unigram code. If a user wants a spm compliant tokenizer, the best bet, is to use spm then transform it into a tokenizers with transformers (or some logic is there too: https://github.com/huggingface/tokenizers/blob/master/bindings/python/scripts/convert.py)
That being said, adding a flag to use or not the \0 terminator is definitely an option (it definitely makes sense for unigram to prevent tokens from 2 different sentences to exist, but as you mentionned spm's code does not do that). Maybe in the same vein, adding an optional custom Fn that would be called during is_valid_sentencepiece could be doable (so we could pass something to check for space prefixing). @n1t0 any other ideas ?
Even if we do that though, don't expect 100% compliance with spm. It should be close to perfect at the token level, but probably not all the way (where energies are so close together that float differences could come into account). We also don't use the same unicode normalization (spm uses a custom NFKC by default with custom rules for japanese characters that we don't implement and there's one rule in particular that is removed from NFKC that would make full compliance very hard to achieve without just using spm)
@Narsil the more I think about it, the more I think the end result is much the same from spm and this library. By pre-tokenizing, adding the '\0' and filtering out all substrings which contain it you achieve much the same result as not pre-tokenizing and then using is_valid_sentencepiece to filter out strings not prefixed by ▁ or > 16 chars.
Where I think the issue lies in what I believe is the incorrect application of ESA in the first place. By using the same character to denote the end of all strings instead of a different one you end up with a different initial set of substrings that spill over the boundary. Once all substrings which do so are eliminated you end up with a smaller initial class, i.e.
esaxx_rs::suffix_rs("▁cat$▁bat$▁sat$") results in
i freq len sentencepiece
1 3 1 ['$']
3 3 3 ['a', 't', '$']
5 3 2 ['t', '$']
6 3 1 ['▁']
The first 3 are filtered, the result is no substrings! The model becomes character-based.
esaxx_rs::suffix_rs("▁cat#▁bat$▁sat%") results in
i freq len sentencepiece
0 3 2 ['a', 't']
1 3 1 ['t']
2 3 1 ['▁']
For a large diverse corpus, the end result seems ok, although I suspect the expectations are skewed somewhat. For my corpus, I have structured strings that often start and/or end with a common substring. These tend to spill over the terminator and get removed.
What I'm thinking of doing is modify the rust code to use a random terminator from a range outside the alphabet (I only require a-b0-9 and some punctuation) and seeing how many more initial classes it produces. I have found shuffling the training data for my specific data tends to break up runs of common (prefix, body, suffix)
The other option is I'll investigate a (potentially) slower version of SA which generalizes to multiple strings. For my case where I have a large corpus of short strings, I think I can use brute force to generate the SA and LCP arrays and then from this the correct list of substrings but I'm still trying to understand the algorithm.
I do think though your approach of pre-tokenizing is the correct one, provided the tokens are properly processed by the SA algorithm.
Also I found a good multi-string SA algorithm gsa-is. In the associated publication doi.org/10.1016/j.tcs.2017.03.039 they discuss the different ways of concatenating strings (using either the same or different terminators) and the pro's and con's of each.
In particular, when using the same end of string terminator they state that "this alternative may cause standard algorithms to incorrectly compute the LCP array, because lcp-values may exceed the length of the strings"
This is what is happening in esaxx-rs. The LCP is computed in esa.rs, I made a trivial modification - adding an additional check to the PLCP loop to terminate the longest common prefix loop when the terminator is encountered (in this case I used 36 which is $)
// Compare at most 2n log n charcters. Practically fastest
// "Permuted Longest-Common-Prefix Array", Juha Karkkainen, CPM 09
// PLCP = r
let mut h = 0;
let terminator = 36; // added end of string sentinal
for i in 0..n {
if string[i] == terminator {
continue;
}
let j = left[i];
while i + h < n && j + h < n && string[i + h] == string[j + h]
&& string[i + h] != terminator {
h += 1;
}
right[i] = h;
if h > 0 {
h -= 1;
}
}
I believe making this change returns all the common substrings (of freq > 1) which don't span the word/sentence boundaries. I've validated against gsa-is which contains similar code to generate the LCP from the SA but using a terminator.
I've also worked out that in esa.hxx there are 3 steps:
Lines 41-59 computes the PLCP from Karkkainen 09. Lines 61-65 converts this to the LCP (H) using Eq1 (PLCP[SA[j]] = LCP[j]) from Karkkainen 09 Lines 67-90 is the substring traversal algorithm, depended only on LCP and T (the original text) from Toru Kasai 01
And another thing I've noticed today, the sentences which is passed to make_seed_sentence_pieces seems to be a mapping from strings to the distinct count of each string in the corpus (after pre-tokenisation). If you split into words, it's essentially a list of words with their counts.
The issue I see with that is you get then don't get the same result from the suffix algorithm.
For example if you take a contrived example "cat cat cat cat" then instead of passing "cat\0cat\0cat\0cat" the string is "cat\0"
Now of course there's no common substring of cat, so the word cat doesn't get added to the vocab.
I don't think this occurs in spm since they don't pretokenise. A hacked fix is to loop n times when adding the strings to the flat_string which is constructed.
Many things here:
Overall I think in order to do any changes, we would need some kind of benchmark to judge overall quality of the end tokenization on various datasets. In my experience, tokenization is a very subtle beast, and innocent changes might actually hinder quite a bit the ending result of tokenization. For instance, the reason of 16 max characters per token, is because in the wild, there are extremely long "words" like "-----------------------------------------------------" (in a markdown file for instance). That will start using up lots of tokens pretty fast.
For my corpus, I have structured strings that often start and/or end with a common substring. These tend to spill over the terminator and get removed.
If I understand correctly, because your data contains lots of common prefix/suffix, then MOST of the seeds will contain the null terminator and get removed so you don't get could quality seeds for that type of data, is that right ? That's where shuffling could help indeed.
Where I think the issue lies in what I believe is the incorrect application of ESA in the first place. By using the same character to denote the end of all strings instead of a different one you end up with a different initial set of substrings that spill over the boundary. Once all substrings which do so are eliminated you end up with a smaller initial class, i.e.
I see what you mean, but that means you have very similar data passed as sentences, which might explain the spilling. It could also explain why spm does not do it. The cost ofc is that you will have tokens across boundaries in practice which might hurt perfomance down the road anyway, no ? (Because they can't be used in practice because they are actually never encountered in the wild.
This is what is happening in esaxx-rs. The LCP is computed in esa.rs, I made a trivial modification - adding an additional check to the PLCP loop to terminate the longest common prefix loop when the terminator is encountered (in this case I used 36 which is $)
Not too big of straying away from sais algorithm, to be fair, I don't understand it fully enough to understand the full consequences of such a change. In order to do that, I think we would have to prove in some way that the resulting tokenizers are overall better, no (like having more equally distributed token distribution on the end dataset) ?
I don't think this occurs in spm since they don't pretokenise. A hacked fix is to loop n times when adding the strings to the flat_string which is constructed.
Yes. as it happens, we were discussing changing this "<string, int32>" mapping of "word_counts" at some point anyway, to give more freedom to underlying models. If you want to implement though, adding it twice should be enough, during EM maximization, the count is effectively passed so that the energy is correct:
https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/models/unigram/trainer.rs#L410
let z: f64 = lattice.populate_marginal(*freq as f64, &mut expected);
The goal to not add those n times is so that pretokenization can actually play it's role and reduce the actual memory footprint of the overall algorithm.
If I understand correctly, because your data contains lots of common prefix/suffix, then MOST of the seeds will contain the null terminator and get removed so you don't get could quality seeds for that type of data, is that right ? That's where shuffling could help indeed.
Yes that's correct, without correcting the esaxx algorithm most (almost all) of the common substrings get removed and the initial seed is extremely poor for my corpus. I get much worse result using the default unigram model when compared to both BPE, and unigram with the modifications I've made - i.e. the loss when I train a model using Roberta with MLM is lower.
Not too big of straying away from sais algorithm, to be fair, I don't understand it fully enough to understand the full consequences of such a change. In order to do that, I think we would have to prove in some way that the resulting tokenizers are overall better, no (like having more equally distributed token distribution on the end dataset)?
I agree, I guess one test is the size (and distribution) of the initial seed - does fixing the LCP calculation increase the seed vocab, if so by how much? I guess the best test is whether the loss for downstream tasks reduces when the LCP is calculated correctly. I've implemented the fix I mentioned above and created a tokenizer.json which I then imported into a Robert MLM model. The loss dropped from 0.8 to 0.5 - a BPE based tokenizer results in a loss of ~0.6.
BTW this isn't actually part of SA_IS, that algorithm is implemented in saisxx and only returns a suffix_array (SA) which is the array of sorted suffixes and as far as I can see is 100% correct. It's suffixtree which seems to be based on ("Permuted Longest-Common-Prefix Array", Juha Karkkainen, CPM 09) to compute the lengths of adjacent suffixes (PLCP[SA[j]] = LCP[j]) which doesn't correctly calculate the prefix length when a string terminator is used. This is where the "enhanced" comes from, you calculate a SA and ancillary arrays (ie LCP) for downstream indexing tasks.
From what I understand there are numerous suffix array algorithms, and many enhanced versions (which created additional arrays), the enumeration of substrings is then a downstream task and it's difficult to find a clear description of how it's done. I cannot find a source for the tree-based iterator code in in suffixtree, I don't think it's in Karkkainen 09 or Nong et al 09. The best I've found is https://www.cs.jhu.edu/~kchurch/wwwfiles/CL_suffix_array.pdf but it looks different to me.
I'm happy to keep using my fork for now if this is to much risk, I've created my own fork of saisxx_rs which accepts an optional terminator and then made a few minor changes to tokenizers to use it.
Are you aware of code from the original paper (Kudo 2018, "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates")? The generation of the seen vocab wasn't covered in detail. All the implementations I've looked at use esaxx from sentancepiece.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.