bi-att-flow icon indicating copy to clipboard operation
bi-att-flow copied to clipboard

nltk tokenize doesn't work?

Open lihongqiang opened this issue 6 years ago • 1 comments

Dear Team, The code below doesn't work and the context doesn't sententce token. if args.tokenizer == "PTB": import nltk sent_tokenize = nltk.sent_tokenize def word_tokenize(tokens): return [token.replace("''", '"').replace("``", '"') for token in nltk.word_tokenize(tokens)] I check the shared_dev.json, I got this "x": [ [ [ [ "The", "income", "tax", "withholding", "rate", "remains", "at", "4.25", "%", "for", "tax", "year", "2015", ".", "However", ",", "the", "personal", "exemption", "amount", "for", "tax", "year", "2015", "will", "change", "to", "$", "4,000", ".", "You", "may", "continue", "to", "use", "2014", "Michigan", "Income", "Tax", "Withholding", "Tables", "." ], But, if I change the code like follows, It works.

import nltk.tokenize as nltk def prepro_each(args, data_type, start_ratio=0.0, stop_ratio=1.0, out_name="default", in_path=None): if args.tokenizer == "PTB":

    # sent_tokenize = nltk.sent_tokenize
    def word_tokenize(tokens):   
        return [token.replace("''", '"').replace("``", '"') for token in nltk.word_tokenize(tokens)]
......
xi = list(map(word_tokenize, nltk.sent_tokenize(context)))

I change the code and run again, but I got a little lower EM and F1. I was very puzzled about it. Could you please help me solve the problem?

lihongqiang avatar Jul 27 '17 02:07 lihongqiang

Same here ! So BIDAF may treat the whole document as a single sentence. Not sure why this works well...my guess is that it implicitly learns to use "." to be a sentence seperator.

Chia-Hsuan-Lee avatar Apr 22 '18 11:04 Chia-Hsuan-Lee