spaCy
spaCy copied to clipboard
Add spancat_exclusive pipeline for non-overlapping span labelling tasks
Context
The current spancat implementation always treats the span labelling task as a multilabel problem. It uses the Logistic layer to output class probabilities independently for each class. However, when presented with a muticlass problem (exclusive class) it might be a disadvantage to not use the correct modeling assumptions. The spancat-exclusive component uses the Softmax layer instead.
Description
This PR adds another pipeline, spancat_exclusive, to account for exclusive classes in span categorization tasks. It does this by introducing the concept of a "negative label" or "no label." In spancat, the number of span labels is exactly the same as what's found in a dataset's annotation. Here, we add another column to account for the negative label.
-
We didn't touch the
add_labelimplementation. It is the same asspancat's. Instead, we implemented two additional properties:_negative_label(returns the index of the negative label) and_n_labelsreturns the length oflabel_data+ 1 and changedinitializeto create aSoftmaxlayer with the extra negative label. -
This in turn affects how the annotations are created during inference (cf.
set_annotations). Again, we modified_make_span_groupto accommodate this change.
Technical Explanation of Changes
⏯️ Training: how is the loss computed this time? (also a note about the negative_weight param)
During initialization, we pass the number of labels (n_labels + 1) so that the score matrix has a shape (n_samples, n_labels + 1), where +1 accounts for the negative example. At train time, the score matrix should already be accounting for the negative example. In this implementation, the negative example is always at the last column.
Figure: Simple example using ConLL labels (ORG, MISC, PER, LOC)

In the get_loss() function, we then assign the value 1.0 to the last column whenever a particular span is a negative example.
# spancat_exclusive.py::SpanCategorizerExclusive.get_loss()
target = self.model.ops.asarray(target, dtype="f") # type: ignore
negative_samples = numpy.nonzero(negative_spans)[0]
target[negative_samples, self._negative_label] = 1.0
We then compute the scores and loss for backprop as usual (i.e., d_scores = scores - target). We also added an option of specifying a negative_weight to "control" the effect of the negative class (a form of class weighing). Higher values (>1) increases the effect of the negative class, while lower values minimizes it (<1).
⏯️ Inference: how are the annotations predicted? (also a note about the allow_overlap param)
During inference, we remove the samples where the negative label is the prediction. In addition, if the allow_overlap parameter is set to False, then overlapping spans are not stored (only the span with the highest predict probability). This is then tracked by the Ranges data structure.
⏯️ Testing on other datasets [WIP]
TODO - Compare spancat and spancat_exclusive on some datasets
Types of change
- Feature implementation
- New pipeline for exclusive spancat
- Tests and documentation
Checklist
- [ ] I confirm that I have the right to submit this contribution under the project's MIT license.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
I think implementation-wise this PR can be reviewed. We'd definitely still want to run a few experiments comparing exclusive_spancat and spancat on a number of datasets. I'm not sure if I should do that one first, or perhaps you want to take a look on the current code before we run experiments. cc: @kadarakos @adrianeboyd
Hi! I've added a benchmark that tests spancat_exclusive and spancat on a number of NER datasets. Overall, the former seems to work well (across three trials, reporting their avg. and stdev). I'd like another round of review again to check in with the implementation. 🙇
You can find the benchmarking project here. It's my fork of explosion/projects. Once exclusive_spancat's been merged, I can also make another PR to explosion/projects to include this benchmark.
Can you add tests for this similar to the existing spancat tests?
Hi! I extended the tests from spancat to also include spancat_exclusive. I also updated the website documentation for spancat to mention additional parameters from spancat_exclusive.
@explosion-bot please test_gpu
🪁 Successfully triggered build on Buildkite
URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/117
@ljvmiranda921 : could you have a look at the conflicts? 🙏
It looks like it needs to be updated to handle empty docs.
Ok will do that!
Saw the updates and the merge conflict. I'll solve it first thing tomorrow!
Done! I saw the comments regarding the docstrings for spancat.py. I can include the changes here unless we want it in a different PR!
I saw the comments regarding the docstrings for
spancat.py. I can include the changes here unless we want it in a different PR!
These are pretty minimal fixes, I don't think you need to open a new PR for them, it's basically making sure things are consistent across the two components.
A few things related to the overall config/design vs. spancat:
- this feels like it should be what happens for
max_positive = 1inspancat - wouldn't it make sense for
allow_overlap = falseto also be an option forspancat?
In general it feels like this functionality should be possible with options for one SpanCategorizer class and the only difference for the spancat_exclusive factory (or whatever it's called) is that there is a different model in the default config. (I'm not entirely sure, but I do think it's going to be tricky to do all this with one spancat config because you can't parameterize the default model.)
I agree that max_positive = 1 implies spancat_exclusive basically. Also agree that allow_overlap is not specific to spancat_exclusive.
But the difference is not only the Logistic vs. Softmax in the output layer, but the get_loss is also different due to the negative class. Also, spancat_exclusive has the negative_weight parameter, whereas spancat has the threshold. So if the SpanCategorizer was implementing both it would have threshold which would be out of use for max_positive = 1 and negative_weights should only be used for max_positive = 1.
Should we merge the two implementations together into a single SpanCategorizer first and see how that looks like?
The spancat_exclusive.py will be deleted if we agree on the design. I quite like it that there is one SpanCategorizer and two factories.
Just a couple small notes, not a thorough review yet
I still need to add a make_span_group_singlelable and make_span_group_multilabel test to see if they work in the add_negative_label case.
Encountered a bit of problem. When I have the add_negative_label = True I need to access the self._negative_label_index. But _make_span_group takes labels: List[str] and so I don't know which one is supposed to be the negative label. In general, not sure why the method ended up like this originally like why allow the _make_span_group take any arbitrary list as labels that actually doesn't have to match the scores and indices coming from the model?
I can probably work around it maybe somehow, but I feel like what is already there is kinda clunky to begin with. I don't think we should use this arbitrary List[str] and it would be better to use the component's own self.labels or self.label_map? But maybe I'm missing some use case where this flexibility is nice? But if so, why not have make_span_group in utils that takes all the different kinds of arguments, but make sure that the SpanCategorizer calls it with something meaningful? But then this would be breaking :( .
I think it makes sense to refactor _make_span_group to remove the labels argument.
I think it makes sense to refactor
_make_span_groupto remove thelabelsargument.
Okay that's nice! I'm getting somewhere with it, but in the meantime I also realized that I do not necessarily understand the reason for having the max_positive argument actually in the multilabel case either. We already have the threshold. So why would you say this: "keep only the top-3 labels that pass the threshold"? Why not all labels that pass the threshold?
This came up, because in the add_negative_label case if the negative_label is in the top-3 then the current code returns only the top-2. So I would then have to look for more options that pass the threshold. Just made me think about why not just all of it? What's the use case?
I think the last thing for the docs is to figure out how to mark all the new things clearly as new.
I think the last thing for the docs is to figure out how to mark all the new things clearly as new.
How do we do that usually? We put this indication like from which version a feature is available right?
Yes, there would be new tags from 3.5.1. What I'm not sure is exactly how best to mark spancat_singlelabel on the combined API docs page.
I added the <Tag variant="new">3.5.1</Tag> in in some places where I've felt like it communicates it more or less unambiguously what is new, but I wasn't exactly sure.
The markers should go next to the setting name in the settings column rather than in the descriptions.
Could you also revert all the formatting-only changes to the other .mdx files?
I think this is good to go!