Filtering words composed of more than 1 token
Hi, thanks for the great works.
I see that you are filtering out words that are composed of more than one token: https://github.com/uber-research/PPLM/blob/5f27e191798b832b51cfc9a83697afd83dc4832c/run_pplm.py#L390, which makes it filter quite a bit of words (including all terms that have more than one word).
Do you have any idea how to deal with this when we want to use these multi token words?
Cheers.
I think one option would be to compute the probability of multiple tokens being generated and use that the same way the single token probability is being used.
Let's say there is a word that splits into two tokens s1, s2: Instead of p(w|x) in equation 5, you could potentially replace this by p(s1|x)*p(s2|s1,x), and I suspect that should work with everything else as is.
I haven't tested this, if you have any luck with this, let us know. Alternatively, I plan on testing it at some point soon and can get back (will update the code appropriately).
bow_indices.append(
[tokenizer.encode(word.strip(),
add_prefix_space=True,
add_special_tokens=False)
for word in words])
i try to run this code, and all words are composed of more than one token,
i think this is because add_prefix_space=True.
Did I do something wrong?
Hi, any update on this?
Hi, It's there any implementation of code on phrases(or more than one token)?
bow_indices.append( [tokenizer.encode(word.strip(), add_prefix_space=True, add_special_tokens=False) for word in words])i try to run this code, and all words are composed of more than one token, i think this is because add_prefix_space=True. Did I do something wrong?
@monkdou0
setting add_prefix_space to True, will not make it into more token ids.