spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Improve Ligurian tokenization

Open jeanm opened this issue 1 year ago • 1 comments

Description

  • Corrected one spelling mistake in the example text.
  • Improved suffixes/prefixes:
    • support elision in years (e.g. 1990 -> ’90),
    • don't split the degree symbol when it's part of a degree unit (e.g. °C),
    • don't split left apostrophes for the few words that can have elision occur on the left (’na, ’n, ’n’).
  • Improved handling of special cases:
    • handle compound prepositions (e.g. a-a, co-i) in a way that doesn't break compatibility with how they're dealt with in Universal Dependencies (using NORM as described in https://github.com/explosion/spaCy/issues/1460),
    • handle cases such as °C, and generate the correct NORM forms for cases such as ’na.
  • Added tests for all of the above.
  • Added a few extra stop words, including variants with the curly quote / typographic apostrophe (commonly found in Ligurian corpora).

Types of change

Enhanced support for the Ligurian language.

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [x] I ran the tests, and all new and existing tests passed.
  • [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

jeanm avatar Nov 25 '24 03:11 jeanm

Bump – let me know if I can help in any way to get this merged 🙂

jeanm avatar Sep 26 '25 03:09 jeanm