stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

[RFC]: Improvements to @stdlib/nlp-expand-contractions

Open titanism opened this issue 2 years ago • 8 comments

Description

We're writing as we found your library to be the most tested and fastest for expanding contractions. For context, we're working on https://spamscanner.net and expanding contractions before passing to tokenizers for spam classification.

To clarify, this is with regards to the generated codebase https://github.com/stdlib-js/nlp-expand-contractions from the source at https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/nlp/expand-contractions.

We noticed that your library is missing quite a few contractions in English, and could also benefit from contractions from other languages too (perhaps with an option).

While we can open a PR, we wanted to check to see what your thoughts were on this and how you might want the PR to look like (integration wise; e.g. new options?).

Here is our current compiled list of research and findings:

  • List of contractions research
    • https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions
      • https://en.wiktionary.org/wiki/Category:English_double_contractions
      • https://gist.github.com/loretoparisi/c221a9c55fb71a23ff4e7bba3b794425?permalink_comment_id=4198425
    • https://www.enchantedlearning.com/grammar/contractions/
    • https://github.com/NaturalNode/natural/issues/533
    • https://github.com/anton-bot/expand-contractions/pull/1
    • https://github.com/kootenpv/contractions/blob/master/contractions/data/leftovers_dict.json
    • https://github.com/kootenpv/contractions/blob/master/contractions/data/slang_dict.json
    • https://github.com/stdlib-js/nlp-expand-contractions/blob/main/lib/contractions.json
    • https://github.com/stdlib-js/nlp-expand-contractions/blob/main/lib/expand_contractions.js
    • https://github.com/textlint-rule/textlint-rule-preset-google/blob/master/packages/textlint-rule-google-contractions/src/textlint-rule-google-contractions.js#L67-L87
    • https://web.library.yale.edu/cataloging/months
    • https://www.wikidata.org/w/index.php?search=%2B%22Category%3A%22+%2B%22contractions%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns120=1
      • There are so many other lists that we can scrape:
        • For example, French contractions: https://en.wiktionary.org/wiki/Category:French_contractions
  • Here are other contractions we think should be included (that were not found elsewhere)
    • 'twas -> it was
    • 'tisn't -> it is not
    • ma'am -> madam
    • mightn't've -> might not have
    • mustn't've -> must not have
    • ne'er-do-well -> never do well
    • o' -> of
    • o'clock -> of the clock
    • she'd've -> she would have
    • shouldn't've -> should not have
    • wouldn't've -> would not have

Related Issues

No response

Questions

No response

Other

No response

Checklist

  • [X] I have read and understood the Code of Conduct.
  • [X] Searched for existing issues and pull requests.
  • [X] The issue name begins with RFC:.

titanism avatar Jun 13 '22 05:06 titanism

:tada: Welcome! :tada:

And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:

github-actions[bot] avatar Jun 13 '22 05:06 github-actions[bot]

Doing a review and will submit a PR to contractions.json with changes.

Caught some interesting bugs like "what's": "what has/is", in the JSON (which is obviously a bug).

The other question I wanted to raise is that we should probably handle and and interchangeably somehow.

titanism avatar Jun 13 '22 05:06 titanism

Re: missing contractions. Some of the entries in your list are already present in the contractions file. E.g., wouldn't've, mightn't've.

kgryte avatar Jun 13 '22 05:06 kgryte

@Planeshifter Is there a reason for the what has/is entry?

kgryte avatar Jun 13 '22 06:06 kgryte

Re: fancy apostrophe. That should be possible to handle in the @stdlib/nlp/tokenize package.

kgryte avatar Jun 13 '22 06:06 kgryte

I'm about to submit a PR, one moment @kgryte

titanism avatar Jun 13 '22 06:06 titanism

See https://github.com/stdlib-js/stdlib/pull/497

cc @kgryte

titanism avatar Jun 13 '22 06:06 titanism

@titanism One recent update: @Planeshifter added initial support for expanding acronyms (see https://github.com/stdlib-js/stdlib/tree/c624a5eb4bca8f4f3d45e01bcc4eeee41652e3ba/lib/node_modules/%40stdlib/nlp/expand-acronyms). This may help to avoid mixing contraction/acronym concerns.

kgryte avatar Jun 18 '22 21:06 kgryte