rstfinder icon indicating copy to clipboard operation
rstfinder copied to clipboard

tokenization issues for non-ascii texts

Open mheilman opened this issue 10 years ago • 11 comments

The NLTK tokenizer used in the code doesn't handle fancy quotation marks very well. They just end up attached to words rather than being separate tokens.

We should probably either preprocess the input that is passed to the tokenizer, find another tokenizer, or fix the current one.

There may be some issues related to other types of symbols as well.

mheilman avatar Aug 11 '14 15:08 mheilman

And this is why we don't use that internally...

dan-blanchard avatar Aug 11 '14 15:08 dan-blanchard

indeed.

mheilman avatar Aug 11 '14 15:08 mheilman

:+1: Is the issue just the quotation marks or all non-ASCII characters?

dmnapolitano avatar Aug 11 '14 15:08 dmnapolitano

probably a lot of them

mheilman avatar Aug 11 '14 15:08 mheilman

I guess another option would be to use unidecode...

mheilman avatar Aug 11 '14 15:08 mheilman

I guess another option would be to use unidecode...

That's probably unnecessary. There aren't many characters other than quotes, dashes, and ellipses, that would be next to words that you wouldn't want to stay attached.

We could probably just do replacements with a dict like this:

_NON_ASCII = [("\u2026", "..."),  # horizontal ellipsis
              ("\u2018", "`"),    # left single quotation mark
              ("\u2019", "'"),    # right single quotation mark
              ("\u201c", "``"),   # left double quotation mark
              ("\u201d", "''"),   # right double quotation mark
              ("\u2013", "-"),    # en dash
              ("\u00a0", " ")]    # no-break space

dan-blanchard avatar Aug 11 '14 15:08 dan-blanchard

Yeah, that's what I was thinking. If it's choking on "김수진", however, then yeah. 😕

dmnapolitano avatar Aug 11 '14 15:08 dmnapolitano

Well, it's not that it's choking on things, it's just not splitting quotes off.

Handling non-English/ASCII characters is a different issue, since it's not very well defined what the tokenizer should do in those cases.

dan-blanchard avatar Aug 11 '14 15:08 dan-blanchard

It's also relevant that if we ran it through unidecode, it will turn Chinese characters into Pinyin, which may yield English words (e.g., "fan") that could throw off parsing features.

dan-blanchard avatar Aug 11 '14 15:08 dan-blanchard

Hmm, I think the simpler dictionary approach sounds good, but I think that dict above is missing a few things (http://en.wikipedia.org/wiki/Apostrophe).

mheilman avatar Aug 11 '14 16:08 mheilman

@desilinguist here is an issue related to tokenizer, but not exactly what we thought.

The NLTK tokenizer does not find the sentence boundaries correctly. For example, one of the edus output of this parser looks like this: ['or', 'maybe', 'a', 'guy', 'never', 'ask', 'a', 'her', 'out.in', 'case', 'of', 'a', 'guy', 'probably', 'the', 'same', 'comments'] where we can see there should be a new sentence starting in case, but did not do so.

It would be better if we can pass in the tokenized input to rst_parse so it does not do the tokenization.

ghost avatar Sep 03 '20 16:09 ghost