spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Add spancat_exclusive pipeline for non-overlapping span labelling tasks

Open ljvmiranda921 opened this issue 1 year ago • 1 comments

Context

The current spancat implementation always treats the span labelling task as a multilabel problem. It uses the Logistic layer to output class probabilities independently for each class. However, when presented with a muticlass problem (exclusive class) it might be a disadvantage to not use the correct modeling assumptions. The spancat-exclusive component uses the Softmax layer instead.

Description

This PR adds another pipeline, spancat_exclusive, to account for exclusive classes in span categorization tasks. It does this by introducing the concept of a "negative label" or "no label." In spancat, the number of span labels is exactly the same as what's found in a dataset's annotation. Here, we add another column to account for the negative label.

  • We didn't touch the add_label implementation. It is the same as spancat's. Instead, we implemented two additional properties: _negative_label (returns the index of the negative label) and _n_labels returns the length of label_data + 1 and changed initialize to create a Softmax layer with the extra negative label.

  • This in turn affects how the annotations are created during inference (cf. set_annotations). Again, we modified _make_span_group to accommodate this change.

Technical Explanation of Changes

⏯️ Training: how is the loss computed this time? (also a note about the negative_weight param)

During initialization, we pass the number of labels (n_labels + 1) so that the score matrix has a shape (n_samples, n_labels + 1), where +1 accounts for the negative example. At train time, the score matrix should already be accounting for the negative example. In this implementation, the negative example is always at the last column.

Figure: Simple example using ConLL labels (ORG, MISC, PER, LOC)

image

In the get_loss() function, we then assign the value 1.0 to the last column whenever a particular span is a negative example.

# spancat_exclusive.py::SpanCategorizerExclusive.get_loss()
target = self.model.ops.asarray(target, dtype="f")  # type: ignore
negative_samples = numpy.nonzero(negative_spans)[0]
target[negative_samples, self._negative_label] = 1.0

We then compute the scores and loss for backprop as usual (i.e., d_scores = scores - target). We also added an option of specifying a negative_weight to "control" the effect of the negative class (a form of class weighing). Higher values (>1) increases the effect of the negative class, while lower values minimizes it (<1).

⏯️ Inference: how are the annotations predicted? (also a note about the allow_overlap param)

During inference, we remove the samples where the negative label is the prediction. In addition, if the allow_overlap parameter is set to False, then overlapping spans are not stored (only the span with the highest predict probability). This is then tracked by the Ranges data structure.

⏯️ Testing on other datasets [WIP]

TODO - Compare spancat and spancat_exclusive on some datasets

Types of change

  • Feature implementation
  • New pipeline for exclusive spancat
  • Tests and documentation

Checklist

  • [ ] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [ ] I ran the tests, and all new and existing tests passed.
  • [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

ljvmiranda921 avatar Aug 24 '22 03:08 ljvmiranda921

I think implementation-wise this PR can be reviewed. We'd definitely still want to run a few experiments comparing exclusive_spancat and spancat on a number of datasets. I'm not sure if I should do that one first, or perhaps you want to take a look on the current code before we run experiments. cc: @kadarakos @adrianeboyd

ljvmiranda921 avatar Sep 08 '22 06:09 ljvmiranda921