Some compiles not applying duplicate n-gram suffixes
I have encountered a strange bug where my compile of the Python swig wrapper does not add the suffixes for duplicate n-grams from footnote 1 in the paper.
I have been compiling the swig wrapper from a checkout of the repository at d4dca6813cd134819af57bd2340a197dc9e855ec, using GCC (I believe 9.4.0). I also have a custom Cython wrapper which I've been building against the same checkout of the repository. In some cases the swig wrapper and my wrapper get different output on retrieve() calls. For example, if I insert the single string "assesses" in a database and then retrieve "resssessea", the swig wrapper will match "assesses" but my wrapper will match nothing.
In ngrams() in ngram.h the following code implements the suffixes for duplicate n-grams:
// Append numbers if the same n-gram occurs more than once.
for (int i = 2;i <= it->second;++i) {
stringstream_type ss;
ss << it->first << i;
*ins = ss.str();
}
Adding debugging output either here or in the calling code in simstring.h shows that with my compile of the swig wrapper n-grams all have size 3 even when there are duplicates. My cython wrapper compiled against the same copy of the header file shows two n-grams of size 4 as expected. I don't understand C++ well enough to understand why << i would work in one compile but not another. I tried a number of variations of the code but didn't get anything working with stringstream.
However, the following should be equivalent and does appear to work with both wrappers:
// Append numbers encoded as sequences of spaces if the same n-gram occurs more than once.
for (int i = 2;i <= it->second;++i) {
string_type s;
s.append(it->first);
s.append(i - 1, ' ');
*ins = s;
}