PPLM icon indicating copy to clipboard operation
PPLM copied to clipboard

Filtering words composed of more than 1 token

Open mataney opened this issue 6 years ago • 5 comments

Hi, thanks for the great works.

I see that you are filtering out words that are composed of more than one token: https://github.com/uber-research/PPLM/blob/5f27e191798b832b51cfc9a83697afd83dc4832c/run_pplm.py#L390, which makes it filter quite a bit of words (including all terms that have more than one word).

Do you have any idea how to deal with this when we want to use these multi token words?

Cheers.

mataney avatar Dec 09 '19 09:12 mataney

I think one option would be to compute the probability of multiple tokens being generated and use that the same way the single token probability is being used.

Let's say there is a word that splits into two tokens s1, s2: Instead of p(w|x) in equation 5, you could potentially replace this by p(s1|x)*p(s2|s1,x), and I suspect that should work with everything else as is.

I haven't tested this, if you have any luck with this, let us know. Alternatively, I plan on testing it at some point soon and can get back (will update the code appropriately).

dathath avatar Dec 10 '19 00:12 dathath

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])

i try to run this code, and all words are composed of more than one token,
i think this is because add_prefix_space=True. Did I do something wrong?

monkdou0 avatar May 06 '21 13:05 monkdou0

Hi, any update on this?

vaibhavvarshney0 avatar Jun 14 '21 16:06 vaibhavvarshney0

Hi, It's there any implementation of code on phrases(or more than one token)?

janleemark avatar Oct 28 '21 13:10 janleemark

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])

i try to run this code, and all words are composed of more than one token, i think this is because add_prefix_space=True. Did I do something wrong?

@monkdou0

image setting add_prefix_space to True, will not make it into more token ids.

yanan1116 avatar Jan 11 '22 20:01 yanan1116