spaCy
spaCy copied to clipboard
Improve Ligurian tokenization
Description
- Corrected one spelling mistake in the example text.
- Improved suffixes/prefixes:
- support elision in years (e.g. 1990 -> ’90),
- don't split the degree symbol when it's part of a degree unit (e.g. °C),
- don't split left apostrophes for the few words that can have elision occur on the left (’na, ’n, ’n’).
- Improved handling of special cases:
- handle compound prepositions (e.g. a-a, co-i) in a way that doesn't break compatibility with how they're dealt with in Universal Dependencies (using
NORMas described in https://github.com/explosion/spaCy/issues/1460), - handle cases such as °C, and generate the correct
NORMforms for cases such as ’na.
- handle compound prepositions (e.g. a-a, co-i) in a way that doesn't break compatibility with how they're dealt with in Universal Dependencies (using
- Added tests for all of the above.
- Added a few extra stop words, including variants with the curly quote / typographic apostrophe (commonly found in Ligurian corpora).
Types of change
Enhanced support for the Ligurian language.
Checklist
- [x] I confirm that I have the right to submit this contribution under the project's MIT license.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
Bump – let me know if I can help in any way to get this merged 🙂