stdlib [RFC]: Improvements to @stdlib/nlp-expand-contractions

Description

We're writing as we found your library to be the most tested and fastest for expanding contractions. For context, we're working on https://spamscanner.net and expanding contractions before passing to tokenizers for spam classification.

To clarify, this is with regards to the generated codebase https://github.com/stdlib-js/nlp-expand-contractions from the source at https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/nlp/expand-contractions.

We noticed that your library is missing quite a few contractions in English, and could also benefit from contractions from other languages too (perhaps with an option).

While we can open a PR, we wanted to check to see what your thoughts were on this and how you might want the PR to look like (integration wise; e.g. new options?).

Here is our current compiled list of research and findings:

List of contractions research
- https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions
  - https://en.wiktionary.org/wiki/Category:English_double_contractions
  - https://gist.github.com/loretoparisi/c221a9c55fb71a23ff4e7bba3b794425?permalink_comment_id=4198425
- https://www.enchantedlearning.com/grammar/contractions/
- https://github.com/NaturalNode/natural/issues/533
- https://github.com/anton-bot/expand-contractions/pull/1
- https://github.com/kootenpv/contractions/blob/master/contractions/data/leftovers_dict.json
- https://github.com/kootenpv/contractions/blob/master/contractions/data/slang_dict.json
- https://github.com/stdlib-js/nlp-expand-contractions/blob/main/lib/contractions.json
- https://github.com/stdlib-js/nlp-expand-contractions/blob/main/lib/expand_contractions.js
- https://github.com/textlint-rule/textlint-rule-preset-google/blob/master/packages/textlint-rule-google-contractions/src/textlint-rule-google-contractions.js#L67-L87
- https://web.library.yale.edu/cataloging/months
- https://www.wikidata.org/w/index.php?search=%2B%22Category%3A%22+%2B%22contractions%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns120=1
  - There are so many other lists that we can scrape:
    - For example, French contractions: https://en.wiktionary.org/wiki/Category:French_contractions
Here are other contractions we think should be included (that were not found elsewhere)
- 'twas -> it was
- 'tisn't -> it is not
- ma'am -> madam
- mightn't've -> might not have
- mustn't've -> must not have
- ne'er-do-well -> never do well
- o' -> of
- o'clock -> of the clock
- she'd've -> she would have
- shouldn't've -> should not have
- wouldn't've -> would not have

Related Issues

No response

Questions

No response

Other

No response

Checklist

[X] I have read and understood the Code of Conduct.
[X] Searched for existing issues and pull requests.
[X] The issue name begins with RFC:.

Jun 13 '22 05:06 titanism

:tada: Welcome! :tada:

And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:

Jun 13 '22 05:06 github-actions[bot]

Doing a review and will submit a PR to contractions.json with changes.

Caught some interesting bugs like "what's": "what has/is", in the JSON (which is obviously a bug).

The other question I wanted to raise is that we should probably handle ’ and ‘ and ’ interchangeably somehow.

Jun 13 '22 05:06 titanism

Re: missing contractions. Some of the entries in your list are already present in the contractions file. E.g., wouldn't've, mightn't've.

Jun 13 '22 05:06 kgryte

@Planeshifter Is there a reason for the what has/is entry?

Jun 13 '22 06:06 kgryte

Re: fancy apostrophe. That should be possible to handle in the @stdlib/nlp/tokenize package.

Jun 13 '22 06:06 kgryte

I'm about to submit a PR, one moment @kgryte

Jun 13 '22 06:06 titanism

See https://github.com/stdlib-js/stdlib/pull/497

cc @kgryte

Jun 13 '22 06:06 titanism

@titanism One recent update: @Planeshifter added initial support for expanding acronyms (see https://github.com/stdlib-js/stdlib/tree/c624a5eb4bca8f4f3d45e01bcc4eeee41652e3ba/lib/node_modules/%40stdlib/nlp/expand-acronyms). This may help to avoid mixing contraction/acronym concerns.

Jun 18 '22 21:06 kgryte

stdlib stdlib copied to clipboard

[RFC]: Improvements to @stdlib/nlp-expand-contractions

Description

Related Issues

Questions

Other

Checklist

stdlib
stdlib copied to clipboard