spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

add coordination ruler

Open india-kerle opened this issue 1 year ago • 1 comments

Description

This PR adds two files:

  • spacy/pipeline/coordinationruler.py: This file contains 3 simple coordination splitting rules and acoordination_splitter factory that allows user to add this as a pipe to use the default splitting rules or add their own.
  • spacy/tests/pipeline/test_coordinationruler.py: This file contains tests associated to each method for the CoordinationSplitter class.

It does NOT include anything for documentation as this will be added after the PR is more finalised.

A few questions:

  • I've expanded the initial splitting rules very slightly to be more generalisable to full sentences and not the original skill spans. Should I add additional generalisable splitting rules? There is also a very specific skill splitting function i.e. the token skill must be at the end of phrase.
  • I made this a factory as opposed to a function component because I thought it would be nice for users to be able to add their own custom rules - thoughts?

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [ ] I ran the tests, and all new and existing tests passed.
  • [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

india-kerle avatar Feb 19 '24 12:02 india-kerle

Thanks! Really excited to have something like this in the library.

I've expanded the initial splitting rules very slightly to be more generalisable to full sentences and not the original skill spans. Should I add additional generalisable splitting rules? There is also a very specific skill splitting function i.e. the token skill must be at the end of phrase.

I think one construction that will be especially useful to people is coordination of modifiers in noun phrases. This could be coordination of adjectives, or nouns themselves. Section 2.2 of this thesis has a nice background on one type of construction that will be important to think about, compound nouns: https://www.researchgate.net/profile/Mark-Lauer-2/publication/2784243_Designing_Statistical_Language_Learners_Experiments_on_Noun_Compounds/links/53f9ccf60cf2e3cbf5604ec4/Designing-Statistical-Language-Learners-Experiments-on-Noun-Compounds.pdf

In general we'd like to detect and process stuff like "green and red apples" into "green apples" and "red apples". But we can have deeper nesting than that: stuff like "hot and cold chicken soup", which ends up as "hot chicken soup" and "cold chicken soup". Ultimately we're going to trust the tree structure in the parser (which isn't always fantastic on these things, due to limitations in the training data annotation) but we still want to have some concept of the range of tree shapes so we can make the test cases for them.

I would suggest first focussing on the cases where we have coordination inside a noun phrase. These will be the ones most useful for entity recognition. If we can enumerate the main construction cases we want to cover, we can then put together the target trees for them, and then test for those. For the tests, we definitely want to specify the dependency parse as part of the test case rather than letting it be predicted by the model. This way the test describes the tree, and also if we have different versions of the model the test doesn't break because it predicted something unexpected.

I made this a factory as opposed to a function component because I thought it would be nice for users to be able to add their own custom rules - thoughts?

Yes the extensibility is definitely good. Arguably we also want to support matcher or dependency matcher patterns directly, but this could be done via a function that takes the patterns as an argument.

honnibal avatar Feb 19 '24 15:02 honnibal