spaCy
spaCy copied to clipboard
Add spancat_exclusive pipeline for non-overlapping span labelling tasks
Context
The current spancat
implementation always treats the span labelling task as a multilabel problem. It uses the Logistic
layer to output class probabilities independently for each class. However, when presented with a muticlass problem (exclusive class) it might be a disadvantage to not use the correct modeling assumptions. The spancat-exclusive
component uses the Softmax
layer instead.
Description
This PR adds another pipeline, spancat_exclusive
, to account for exclusive classes in span categorization tasks. It does this by introducing the concept of a "negative label" or "no label." In spancat
, the number of span labels is exactly the same as what's found in a dataset's annotation. Here, we add another column to account for the negative label.
-
We didn't touch the
add_label
implementation. It is the same asspancat
's. Instead, we implemented two additional properties:_negative_label
(returns the index of the negative label) and_n_labels
returns the length oflabel_data
+ 1 and changedinitialize
to create aSoftmax
layer with the extra negative label. -
This in turn affects how the annotations are created during inference (cf.
set_annotations
). Again, we modified_make_span_group
to accommodate this change.
Technical Explanation of Changes
⏯️ Training: how is the loss computed this time? (also a note about the negative_weight
param)
During initialization, we pass the number of labels (n_labels + 1
) so that the score matrix has a shape (n_samples, n_labels + 1)
, where +1
accounts for the negative example. At train time, the score matrix should already be accounting for the negative example. In this implementation, the negative example is always at the last column.
Figure: Simple example using ConLL labels (ORG, MISC, PER, LOC)
In the get_loss()
function, we then assign the value 1.0
to the last column whenever a particular span is a negative example.
# spancat_exclusive.py::SpanCategorizerExclusive.get_loss()
target = self.model.ops.asarray(target, dtype="f") # type: ignore
negative_samples = numpy.nonzero(negative_spans)[0]
target[negative_samples, self._negative_label] = 1.0
We then compute the scores and loss for backprop as usual (i.e., d_scores = scores - target
). We also added an option of specifying a negative_weight
to "control" the effect of the negative class (a form of class weighing). Higher values (>1) increases the effect of the negative class, while lower values minimizes it (<1).
⏯️ Inference: how are the annotations predicted? (also a note about the allow_overlap
param)
During inference, we remove the samples where the negative label is the prediction. In addition, if the allow_overlap
parameter is set to False
, then overlapping spans are not stored (only the span with the highest predict probability). This is then tracked by the Ranges
data structure.
⏯️ Testing on other datasets [WIP]
TODO - Compare spancat
and spancat_exclusive
on some datasets
Types of change
- Feature implementation
- New pipeline for exclusive spancat
- Tests and documentation
Checklist
- [ ] I confirm that I have the right to submit this contribution under the project's MIT license.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
I think implementation-wise this PR can be reviewed. We'd definitely still want to run a few experiments comparing exclusive_spancat
and spancat
on a number of datasets. I'm not sure if I should do that one first, or perhaps you want to take a look on the current code before we run experiments. cc: @kadarakos @adrianeboyd
Hi! I've added a benchmark that tests spancat_exclusive
and spancat
on a number of NER datasets. Overall, the former seems to work well (across three trials, reporting their avg. and stdev). I'd like another round of review again to check in with the implementation. 🙇
You can find the benchmarking project here. It's my fork of explosion/projects
. Once exclusive_spancat
's been merged, I can also make another PR to explosion/projects
to include this benchmark.
Can you add tests for this similar to the existing spancat
tests?
Hi! I extended the tests from spancat
to also include spancat_exclusive
. I also updated the website documentation for spancat
to mention additional parameters from spancat_exclusive
.
@explosion-bot please test_gpu
🪁 Successfully triggered build on Buildkite
URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/117
@ljvmiranda921 : could you have a look at the conflicts? 🙏
It looks like it needs to be updated to handle empty docs.
Ok will do that!
Saw the updates and the merge conflict. I'll solve it first thing tomorrow!
Done! I saw the comments regarding the docstrings for spancat.py
. I can include the changes here unless we want it in a different PR!
I saw the comments regarding the docstrings for
spancat.py
. I can include the changes here unless we want it in a different PR!
These are pretty minimal fixes, I don't think you need to open a new PR for them, it's basically making sure things are consistent across the two components.
A few things related to the overall config/design vs. spancat
:
- this feels like it should be what happens for
max_positive = 1
inspancat
- wouldn't it make sense for
allow_overlap = false
to also be an option forspancat
?
In general it feels like this functionality should be possible with options for one SpanCategorizer
class and the only difference for the spancat_exclusive
factory (or whatever it's called) is that there is a different model in the default config. (I'm not entirely sure, but I do think it's going to be tricky to do all this with one spancat
config because you can't parameterize the default model.)
I agree that max_positive = 1
implies spancat_exclusive
basically. Also agree that allow_overlap
is not specific to spancat_exclusive
.
But the difference is not only the Logistic
vs. Softmax
in the output layer, but the get_loss
is also different due to the negative class. Also, spancat_exclusive
has the negative_weight
parameter, whereas spancat
has the threshold
. So if the SpanCategorizer
was implementing both it would have threshold
which would be out of use for max_positive = 1
and negative_weights
should only be used for max_positive = 1
.
Should we merge the two implementations together into a single SpanCategorizer
first and see how that looks like?
The spancat_exclusive.py
will be deleted if we agree on the design. I quite like it that there is one SpanCategorizer
and two factories.
Just a couple small notes, not a thorough review yet
I still need to add a make_span_group_singlelable
and make_span_group_multilabel
test to see if they work in the add_negative_label
case.
Encountered a bit of problem. When I have the add_negative_label = True
I need to access the self._negative_label_index
. But _make_span_group
takes labels: List[str]
and so I don't know which one is supposed to be the negative label. In general, not sure why the method ended up like this originally like why allow the _make_span_group
take any arbitrary list as labels that actually doesn't have to match the scores and indices coming from the model?
I can probably work around it maybe somehow, but I feel like what is already there is kinda clunky to begin with. I don't think we should use this arbitrary List[str]
and it would be better to use the component's own self.labels
or self.label_map
? But maybe I'm missing some use case where this flexibility is nice? But if so, why not have make_span_group
in utils that takes all the different kinds of arguments, but make sure that the SpanCategorizer
calls it with something meaningful? But then this would be breaking :( .
I think it makes sense to refactor _make_span_group
to remove the labels
argument.
I think it makes sense to refactor
_make_span_group
to remove thelabels
argument.
Okay that's nice! I'm getting somewhere with it, but in the meantime I also realized that I do not necessarily understand the reason for having the max_positive
argument actually in the multilabel
case either. We already have the threshold
. So why would you say this: "keep only the top-3 labels that pass the threshold
"? Why not all labels that pass the threshold
?
This came up, because in the add_negative_label
case if the negative_label
is in the top-3 then the current code returns only the top-2. So I would then have to look for more options that pass the threshold
. Just made me think about why not just all of it? What's the use case?
I think the last thing for the docs is to figure out how to mark all the new things clearly as new.
I think the last thing for the docs is to figure out how to mark all the new things clearly as new.
How do we do that usually? We put this indication like from which version a feature is available right?
Yes, there would be new tags from 3.5.1
. What I'm not sure is exactly how best to mark spancat_singlelabel
on the combined API docs page.
I added the <Tag variant="new">3.5.1</Tag>
in in some places where I've felt like it communicates it more or less unambiguously what is new, but I wasn't exactly sure.
The markers should go next to the setting name in the settings column rather than in the descriptions.
Could you also revert all the formatting-only changes to the other .mdx
files?
I think this is good to go!