spaCy Native coref component

trafficstars

Work-in-progress

Description

Creating a native coref component in spaCy

Types of change

new feature

Checklist

[x] I have submitted the spaCy Contributor Agreement.
[ ] I ran the tests, and all new and existing tests passed.
[ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

Mar 03 '21 12:03 svlandeg

Status March 1:

Wrote preliminary v3-compatible framework to facilitate experimentation with different coref models
Currently assuming two different pipeline components:
- coref_er / CorefEntityRecognizer is a rule-based mention detection algorithm: uses noun chunks, POS tags and named entities
- coref / CoreferenceResolver assembles the provided mentions into clusters (dummy implementation)
Using doc.spans to store the information:
- doc.spans[coref_mentions] for storing all relevant coref mentions (nouns, pronouns, names, ...)
- doc.spans[coref_clusters_i] for different clusters, indexed with i
Coref.v0 needs to be implemented and changed to Coref.v1
Scorer.score_clusters method that currently uses a too simple scoring mechanism (binary relations between mentions), should be refined with actual coref scoring algorithm

While all of this is mostly dummy framework, it already helped discover some bugs & required functionality, cf PRs https://github.com/explosion/spaCy/pull/7197, https://github.com/explosion/spaCy/pull/7209 and https://github.com/explosion/spaCy/pull/7225.

Going forward, having this bare framework should facilitate working on this functionality with different people in parallel, filling in different parts...

TODO

[ ] Implement proper coref ML model
[ ] Proper mention detection algorithm, rule-based, ML-based, something like the SpanCategorizer, ...
[ ] Meaningful evaluation script
[ ] Tune & benchmark
[ ] Rewrite errors to use spacy.errors

Open questions / current issues

While we talked about keeping doc.spans a relatively simple dictionary of strings mapping to lists of spans, we might consider having a more formal way of defining clusters that belong together - currently this is done by matching a prefix in the spans key, which is obviously not ideal
The design with the rule-based coref_er is again awkward, because this component won't run during nlp.update, meaning that the coref model could only train on gold mentions, which is not a good idea in terms of generalizability and robustness of the ML model.

Mar 03 '21 12:03 svlandeg

Just saying that I hope that the state of the art will be available eventually.

Anyway this is a very welcome improvement that I'm looking forward :)

Mar 29 '21 22:03 LifeIsStrange

Out of curiosity, are there any updates on this?

Jul 14 '21 22:07 KTRosenberg

@KTRosenberg I'm working on it :) When it's ready, we'll do a big announcement and I'll post in the relevant thread on neural-coref too.

I don't think it's that far off, but we won't have a date until it's ready.

Jul 15 '21 03:07 polm

@KTRosenberg I'm working on it :) When it's ready, we'll do a big announcement and I'll post in the relevant thread on neural-coref too.

I don't think it's that far off, but we won't have a date until it's ready.

That’s fair enough. I have a time-sensitive project and ended up just reverting to neuralcoref + Spacy 2.3. Hopefully for my simple cases the old large non-transformer model works well enough ;) I figure the update will need at least another month, so I’ll update when that happens for sure.

Jul 15 '21 03:07 KTRosenberg

@KTRosenberg I'm working on it :) When it's ready, we'll do a big announcement and I'll post in the relevant thread on neural-coref too. I don't think it's that far off, but we won't have a date until it's ready.

That’s fair enough. I have a time-sensitive project and ended up just reverting to neuralcoref + Spacy 2.3. Hopefully for my simple cases the old large non-transformer model works well enough ;) I figure the update will need at least another month, so I’ll update when that happens for sure.

Consider staying with Spacy 3 and integrate https://github.com/msg-systems/coreferee . While I don't have experience with @msg-systems solution - our experience with neural-coref is not so good w.r.t space/time/stability .

Jul 15 '21 08:07 ofirnk

@polm thanks for the updates 😊 Am I correct in thinking you are taking a similar approach to coreferee in terms of dependency match to pull out possible pairs and then neural network to assess them?

Jul 19 '21 10:07 dogberto

@BountyCountry No, the new coref doesn't use the dependency parse at all, it's an end-to-end neural system.

You're welcome to discuss coreferee but could you maybe do it in another thread in Discussions or something? Since this is the PR for the new coref I'd like to keep it focussed on that.

Jul 19 '21 10:07 polm

Great! Tracking this as well

Jul 27 '21 13:07 FedericoCampe8

Great work

Aug 03 '21 08:08 wutaiqiang

@polm Thanks for developing this! Is there some timeline or approximate release dates for the coref component?

Aug 11 '21 12:08 caballeto

@caballeto We're working on it but don't have a date. When it's done we'll announce it here.

Aug 11 '21 13:08 polm

@polm Are you guys still working on this? We're eagerly waiting for this so we can upgrade to spaCy 3 :)

Oct 27 '21 04:10 itssimon

How is this looking?

Nov 25 '21 07:11 KTRosenberg

Apologies for the silence on this! We have deprioritized this work for a few months, but are now picking it back up. We are actively exploring two coref solutions: the approach by @polm outlined in this PR, as well as building further upon the work @richardpaulhudson has done on coreferee, as, incidentally, Richard has recently joined our awesome team ;-)

We want to get this right - both in terms of implementation as well as making sure we address the right problem, as "coreference resolution" doesn't always mean the same thing. We appreciate your patience on this and we aim to make the wait worth your while!

Dec 06 '21 16:12 svlandeg

I want to try this new feature, please let me know when this PR get merged.

Mar 22 '22 15:03 wangcj05

@wangcj05 if you want to keep up on the status of the PR, you can click the "subscribe" button in the Github UI to receive notifications.

Mar 23 '22 05:03 polm

Can't wait to see this feature to be merge. What's the current state?

Apr 23 '22 19:04 moelllerniklas

I could be confused about the details, but I think that all the get_loss and scoring methods in this PR need to be updated to handle misaligned tokenization. There are lots of direct uses of span.start and span.end for reference docs that are not going to work as intended when the predicted tokenization differs.

May 27 '22 08:05 adrianeboyd

Any updates?

Jun 21 '22 19:06 moelllerniklas

Hi! This is work in progress. We've got open issues and comments on this PR and to ensure those remain visible and we can focus on the technical implementation, I've locked this conversation to contributors only. You can follow the status on this PR by looking at the commits and review comments. If you have any other questions, feel free to ask them on our discussion forum. Thanks!

Jun 21 '22 19:06 svlandeg

@explosion-bot please test_gpu

Jul 12 '22 07:07 polm

@explosion-bot please test_gpu

Jul 12 '22 07:07 polm

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/100

Jul 12 '22 07:07 explosion-bot

Closing this PR, as we'll release the functionality in spacy-experimental first: https://github.com/explosion/spacy-experimental/pull/17

The docs PR is here: https://github.com/explosion/spaCy/pull/11291

Aug 11 '22 07:08 svlandeg

Just wanted to send a quick update about coref support in spaCy:

we've released an end-to-end neural coref component as part of spacy-experimental 0.6.0. Just run pip install spacy-experimental==0.6.0 and it will automatically become available in your spaCy installation.
the release contains a pretrained pipeline for you to play with: https://github.com/explosion/spacy-experimental/releases/tag/v0.6.0
If you're interested in training a coref pipeline yourself, check out this project we've assembled: https://github.com/explosion/projects/tree/v3/experimental/coref
we've published a blog with many details on this architecture: https://explosion.ai/blog/coref
a video will be released soon :-)

We'd love for you to try this out, and any feedback is very welcome over at the discussion forum!

Oct 06 '22 14:10 svlandeg

spaCy spaCy copied to clipboard

Native coref component

Description

Types of change

Checklist

TODO

Open questions / current issues

spaCy
spaCy copied to clipboard