spaCy
spaCy copied to clipboard
Native coref component
Work-in-progress
Description
Creating a native coref component in spaCy
Types of change
new feature
Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
Status March 1:
- Wrote preliminary v3-compatible framework to facilitate experimentation with different coref models
- Currently assuming two different pipeline components:
coref_er/CorefEntityRecognizeris a rule-based mention detection algorithm: uses noun chunks, POS tags and named entitiescoref/CoreferenceResolverassembles the provided mentions into clusters (dummy implementation)
- Using
doc.spansto store the information:doc.spans[coref_mentions]for storing all relevant coref mentions (nouns, pronouns, names, ...)doc.spans[coref_clusters_i]for different clusters, indexed withi
Coref.v0needs to be implemented and changed toCoref.v1Scorer.score_clustersmethod that currently uses a too simple scoring mechanism (binary relations between mentions), should be refined with actual coref scoring algorithm
While all of this is mostly dummy framework, it already helped discover some bugs & required functionality, cf PRs https://github.com/explosion/spaCy/pull/7197, https://github.com/explosion/spaCy/pull/7209 and https://github.com/explosion/spaCy/pull/7225.
Going forward, having this bare framework should facilitate working on this functionality with different people in parallel, filling in different parts...
TODO
- [ ] Implement proper
corefML model - [ ] Proper mention detection algorithm, rule-based, ML-based, something like the
SpanCategorizer, ... - [ ] Meaningful evaluation script
- [ ] Tune & benchmark
- [ ] Rewrite errors to use
spacy.errors
Open questions / current issues
-
While we talked about keeping
doc.spansa relatively simple dictionary of strings mapping to lists of spans, we might consider having a more formal way of defining clusters that belong together - currently this is done by matching a prefix in thespanskey, which is obviously not ideal -
The design with the rule-based
coref_eris again awkward, because this component won't run duringnlp.update, meaning that thecorefmodel could only train on gold mentions, which is not a good idea in terms of generalizability and robustness of the ML model.
Just saying that I hope that the state of the art will be available eventually.
Anyway this is a very welcome improvement that I'm looking forward :)
Out of curiosity, are there any updates on this?
@KTRosenberg I'm working on it :) When it's ready, we'll do a big announcement and I'll post in the relevant thread on neural-coref too.
I don't think it's that far off, but we won't have a date until it's ready.
@KTRosenberg I'm working on it :) When it's ready, we'll do a big announcement and I'll post in the relevant thread on neural-coref too.
I don't think it's that far off, but we won't have a date until it's ready.
That’s fair enough. I have a time-sensitive project and ended up just reverting to neuralcoref + Spacy 2.3. Hopefully for my simple cases the old large non-transformer model works well enough ;) I figure the update will need at least another month, so I’ll update when that happens for sure.
@KTRosenberg I'm working on it :) When it's ready, we'll do a big announcement and I'll post in the relevant thread on neural-coref too. I don't think it's that far off, but we won't have a date until it's ready.
That’s fair enough. I have a time-sensitive project and ended up just reverting to neuralcoref + Spacy 2.3. Hopefully for my simple cases the old large non-transformer model works well enough ;) I figure the update will need at least another month, so I’ll update when that happens for sure.
Consider staying with Spacy 3 and integrate https://github.com/msg-systems/coreferee . While I don't have experience with @msg-systems solution - our experience with neural-coref is not so good w.r.t space/time/stability .
@polm thanks for the updates 😊 Am I correct in thinking you are taking a similar approach to coreferee in terms of dependency match to pull out possible pairs and then neural network to assess them?
@BountyCountry No, the new coref doesn't use the dependency parse at all, it's an end-to-end neural system.
You're welcome to discuss coreferee but could you maybe do it in another thread in Discussions or something? Since this is the PR for the new coref I'd like to keep it focussed on that.
Great! Tracking this as well
Great work
@polm Thanks for developing this! Is there some timeline or approximate release dates for the coref component?
@caballeto We're working on it but don't have a date. When it's done we'll announce it here.
@polm Are you guys still working on this? We're eagerly waiting for this so we can upgrade to spaCy 3 :)
How is this looking?
Apologies for the silence on this! We have deprioritized this work for a few months, but are now picking it back up. We are actively exploring two coref solutions: the approach by @polm outlined in this PR, as well as building further upon the work @richardpaulhudson has done on coreferee, as, incidentally, Richard has recently joined our awesome team ;-)
We want to get this right - both in terms of implementation as well as making sure we address the right problem, as "coreference resolution" doesn't always mean the same thing. We appreciate your patience on this and we aim to make the wait worth your while!
I want to try this new feature, please let me know when this PR get merged.
@wangcj05 if you want to keep up on the status of the PR, you can click the "subscribe" button in the Github UI to receive notifications.
Can't wait to see this feature to be merge. What's the current state?
I could be confused about the details, but I think that all the get_loss and scoring methods in this PR need to be updated to handle misaligned tokenization. There are lots of direct uses of span.start and span.end for reference docs that are not going to work as intended when the predicted tokenization differs.
Any updates?
Hi! This is work in progress. We've got open issues and comments on this PR and to ensure those remain visible and we can focus on the technical implementation, I've locked this conversation to contributors only. You can follow the status on this PR by looking at the commits and review comments. If you have any other questions, feel free to ask them on our discussion forum. Thanks!
@explosion-bot please test_gpu
@explosion-bot please test_gpu
🪁 Successfully triggered build on Buildkite
URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/100
Closing this PR, as we'll release the functionality in spacy-experimental first: https://github.com/explosion/spacy-experimental/pull/17
The docs PR is here: https://github.com/explosion/spaCy/pull/11291
Just wanted to send a quick update about coref support in spaCy:
- we've released an end-to-end neural coref component as part of
spacy-experimental0.6.0. Just runpip install spacy-experimental==0.6.0and it will automatically become available in your spaCy installation. - the release contains a pretrained pipeline for you to play with: https://github.com/explosion/spacy-experimental/releases/tag/v0.6.0
- If you're interested in training a
corefpipeline yourself, check out this project we've assembled: https://github.com/explosion/projects/tree/v3/experimental/coref - we've published a blog with many details on this architecture: https://explosion.ai/blog/coref
- a video will be released soon :-)
We'd love for you to try this out, and any feedback is very welcome over at the discussion forum!