PPLM Filtering words composed of more than 1 token

Hi, thanks for the great works.

I see that you are filtering out words that are composed of more than one token: https://github.com/uber-research/PPLM/blob/5f27e191798b832b51cfc9a83697afd83dc4832c/run_pplm.py#L390, which makes it filter quite a bit of words (including all terms that have more than one word).

Do you have any idea how to deal with this when we want to use these multi token words?

Cheers.

Dec 09 '19 09:12 mataney

I think one option would be to compute the probability of multiple tokens being generated and use that the same way the single token probability is being used.

Let's say there is a word that splits into two tokens s1, s2: Instead of p(w|x) in equation 5, you could potentially replace this by p(s1|x)*p(s2|s1,x), and I suspect that should work with everything else as is.

I haven't tested this, if you have any luck with this, let us know. Alternatively, I plan on testing it at some point soon and can get back (will update the code appropriately).

Dec 10 '19 00:12 dathath

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])

i try to run this code, and all words are composed of more than one token,
i think this is because add_prefix_space=True. Did I do something wrong?

May 06 '21 13:05 monkdou0

Hi, any update on this?

Jun 14 '21 16:06 vaibhavvarshney0

Hi, It's there any implementation of code on phrases(or more than one token)?

Oct 28 '21 13:10 janleemark

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])
i try to run this code, and all words are composed of more than one token, i think this is because add_prefix_space=True. Did I do something wrong?

@monkdou0

setting add_prefix_space to True, will not make it into more token ids.

Jan 11 '22 20:01 yanan1116