rstfinder tokenization issues for non-ascii texts

The NLTK tokenizer used in the code doesn't handle fancy quotation marks very well. They just end up attached to words rather than being separate tokens.

We should probably either preprocess the input that is passed to the tokenizer, find another tokenizer, or fix the current one.

There may be some issues related to other types of symbols as well.

Aug 11 '14 15:08 mheilman

And this is why we don't use that internally...

Aug 11 '14 15:08 dan-blanchard

indeed.

Aug 11 '14 15:08 mheilman

:+1: Is the issue just the quotation marks or all non-ASCII characters?

Aug 11 '14 15:08 dmnapolitano

probably a lot of them

Aug 11 '14 15:08 mheilman

I guess another option would be to use unidecode...

Aug 11 '14 15:08 mheilman

I guess another option would be to use unidecode...

That's probably unnecessary. There aren't many characters other than quotes, dashes, and ellipses, that would be next to words that you wouldn't want to stay attached.

We could probably just do replacements with a dict like this:

_NON_ASCII = [("\u2026", "..."),  # horizontal ellipsis
              ("\u2018", "`"),    # left single quotation mark
              ("\u2019", "'"),    # right single quotation mark
              ("\u201c", "``"),   # left double quotation mark
              ("\u201d", "''"),   # right double quotation mark
              ("\u2013", "-"),    # en dash
              ("\u00a0", " ")]    # no-break space

Aug 11 '14 15:08 dan-blanchard

Yeah, that's what I was thinking. If it's choking on "김수진", however, then yeah. 😕

Aug 11 '14 15:08 dmnapolitano

Well, it's not that it's choking on things, it's just not splitting quotes off.

Handling non-English/ASCII characters is a different issue, since it's not very well defined what the tokenizer should do in those cases.

Aug 11 '14 15:08 dan-blanchard

It's also relevant that if we ran it through unidecode, it will turn Chinese characters into Pinyin, which may yield English words (e.g., "fan") that could throw off parsing features.

Aug 11 '14 15:08 dan-blanchard

Hmm, I think the simpler dictionary approach sounds good, but I think that dict above is missing a few things (http://en.wikipedia.org/wiki/Apostrophe).

Aug 11 '14 16:08 mheilman

@desilinguist here is an issue related to tokenizer, but not exactly what we thought.

The NLTK tokenizer does not find the sentence boundaries correctly. For example, one of the edus output of this parser looks like this: ['or', 'maybe', 'a', 'guy', 'never', 'ask', 'a', 'her', 'out.in', 'case', 'of', 'a', 'guy', 'probably', 'the', 'same', 'comments'] where we can see there should be a new sentence starting in case, but did not do so.

It would be better if we can pass in the tokenized input to rst_parse so it does not do the tokenization.

Sep 03 '20 16:09 ghost

rstfinder rstfinder copied to clipboard

tokenization issues for non-ascii texts

rstfinder
rstfinder copied to clipboard