stdlib
stdlib copied to clipboard
[RFC]: Improvements to @stdlib/nlp-expand-contractions
Description
We're writing as we found your library to be the most tested and fastest for expanding contractions. For context, we're working on https://spamscanner.net and expanding contractions before passing to tokenizers for spam classification.
To clarify, this is with regards to the generated codebase https://github.com/stdlib-js/nlp-expand-contractions from the source at https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/nlp/expand-contractions.
We noticed that your library is missing quite a few contractions in English, and could also benefit from contractions from other languages too (perhaps with an option).
While we can open a PR, we wanted to check to see what your thoughts were on this and how you might want the PR to look like (integration wise; e.g. new options?).
Here is our current compiled list of research and findings:
- List of contractions research
- https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions
- https://en.wiktionary.org/wiki/Category:English_double_contractions
- https://gist.github.com/loretoparisi/c221a9c55fb71a23ff4e7bba3b794425?permalink_comment_id=4198425
- https://www.enchantedlearning.com/grammar/contractions/
- https://github.com/NaturalNode/natural/issues/533
- https://github.com/anton-bot/expand-contractions/pull/1
- https://github.com/kootenpv/contractions/blob/master/contractions/data/leftovers_dict.json
- https://github.com/kootenpv/contractions/blob/master/contractions/data/slang_dict.json
- https://github.com/stdlib-js/nlp-expand-contractions/blob/main/lib/contractions.json
- https://github.com/stdlib-js/nlp-expand-contractions/blob/main/lib/expand_contractions.js
- https://github.com/textlint-rule/textlint-rule-preset-google/blob/master/packages/textlint-rule-google-contractions/src/textlint-rule-google-contractions.js#L67-L87
- https://web.library.yale.edu/cataloging/months
- https://www.wikidata.org/w/index.php?search=%2B%22Category%3A%22+%2B%22contractions%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns120=1
- There are so many other lists that we can scrape:
- For example, French contractions: https://en.wiktionary.org/wiki/Category:French_contractions
- There are so many other lists that we can scrape:
- https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions
- Here are other contractions we think should be included (that were not found elsewhere)
- 'twas -> it was
- 'tisn't -> it is not
- ma'am -> madam
- mightn't've -> might not have
- mustn't've -> must not have
- ne'er-do-well -> never do well
- o' -> of
- o'clock -> of the clock
- she'd've -> she would have
- shouldn't've -> should not have
- wouldn't've -> would not have
Related Issues
No response
Questions
No response
Other
No response
Checklist
- [X] I have read and understood the Code of Conduct.
- [X] Searched for existing issues and pull requests.
- [X] The issue name begins with
RFC:
.
:tada: Welcome! :tada:
And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:
Doing a review and will submit a PR to contractions.json
with changes.
Caught some interesting bugs like "what's": "what has/is",
in the JSON (which is obviously a bug).
The other question I wanted to raise is that we should probably handle ’
and ‘
and ’
interchangeably somehow.
Re: missing contractions. Some of the entries in your list are already present in the contractions file. E.g., wouldn't've
, mightn't've
.
@Planeshifter Is there a reason for the what has/is
entry?
Re: fancy apostrophe. That should be possible to handle in the @stdlib/nlp/tokenize
package.
I'm about to submit a PR, one moment @kgryte
See https://github.com/stdlib-js/stdlib/pull/497
cc @kgryte
@titanism One recent update: @Planeshifter added initial support for expanding acronyms (see https://github.com/stdlib-js/stdlib/tree/c624a5eb4bca8f4f3d45e01bcc4eeee41652e3ba/lib/node_modules/%40stdlib/nlp/expand-acronyms). This may help to avoid mixing contraction/acronym concerns.