thinc
thinc copied to clipboard
Cross entropy fix
This PR makes the CategoricalCrossentropy
loss more strict only allowing guesses
and truths
that represent exclusive classes and it fixes the computation of the loss value. It also changes all the tests that involve the CategoricalCrossentropy
and SequenceCategoricalCrossentropy
.
This PR makes the
CategoricalCrossentropy
loss more strict only allowingguesses
andtruths
that represent exclusive classes
Why can't / shouldn't it (also) support multi-label classification?
This PR makes the
CategoricalCrossentropy
loss more strict only allowingguesses
andtruths
that represent exclusive classesWhy can't / shouldn't it (also) support multi-label classification?
I was thinking that I would have a separate loss for that for two main reasons.
The simple one was just that the loss computation itself is different. So if the rows add up to 1 then its the way its done now, otherwise it would have to compute the binary cross-entropy. The gradient is the same though that's true.
The other main reason was just the confusing usage pattern from my perspective. So it takes Ints1d
, List[str]
, List[int]
or Floats2d
as truths
and it also does label-smoothing. The first three types of input and label-smoothing doesn't make sense for the multi-label binary cross entropy case. The only way to make it do binary cross entropy is to give it Floats2d
and label-smoothing = False and in all other configurations it does categorical.
I just thought its nicer to implement a binary cross-entropy separately that can take different kinds of truths
that are more appropriate for multiple labels per sample and has its own docs plus it always computes that same loss. For example it could take a List[Dict[str, float]]
(textcat-multilabel
could directly pass down this one i think) or List[List[int]]
, List[List[str]]
.
Thanks for the additional explanation, and agreed!
The simple one was just that the loss computation itself is different. So if the rows add up to 1 then its the way its done now, otherwise it would have to compute the binary cross-entropy. The gradient is the same though that's true.
But is CategoricalCrossEntropy
currently used anywhere for multi-class classification? If so, this change would break those cases? I am not sure, I haven't really looked much yet at multi-class classification in Thinc/spaCy.
I don't think it's being used in spaCy right now, but it should. And when we do, I'm proposing we work via specific versions, e.g.
loss_function = registry.get("losses", "CategoricalCrossentropy.v3")
Instead of a direct import. This will make it easier to keep the same behaviour if people want it, but to allow others to upgrade if they chose so.
It might still be that other users of Thinc are using the current CategoricalCrossentropy
and would get bitten by a change in numerical values when they're logging results across training runs, or get bitten because the functionality is now more strict as Daniël mentions. Then we'd be able to present the above solution to them for making sure they have access to the same functionality.
The fact that we need to change the tests here, is a huge red flag that this can be very breaking - which is why I'm proposing we lean into the framework of versioning that we've already set up anyway. Then we can keep the tests largely the same for the old functionality but write new ones for the new functionality.
I don't think it's being used in spaCy right now, but it should. And when we do, I'm proposing we work via specific versions, e.g.
loss_function = registry.get("losses", "CategoricalCrossentropy.v3")
Instead of a direct import. This will make it easier to keep the same behaviour if people want it, but to allow others to upgrade if they chose so.
It might still be that other users of Thinc are using the current
CategoricalCrossentropy
and would get bitten by a change in numerical values when they're logging results across training runs, or get bitten because the functionality is now more strict as Daniël mentions. Then we'd be able to present the above solution to them for making sure they have access to the same functionality.The fact that we need to change the tests here, is a huge red flag that this can be very breaking - which is why I'm proposing we lean into the framework of versioning that we've already set up anyway. Then we can keep the tests largely the same for the old functionality but write new ones for the new functionality.
Added a suggestion for handling legacy stuff in general, but I didn't exactly follow your suggestion, but let me know if you'd like that better. Maybe I did follow your suggestion though?! I dunno.
The idea is that there is the thinc.legacy
as you've said. So thinc.legacy
at the moment only has loss.py
meaning that the thinc.loss
should import from thinc.legacy.loss
if it would like to run older versions. The other thing that I'm suggesting here is the when there is a function called func
or class called Class
then the import
should be from thinc.legacy import legacy_func, LegacyClass
. Then the legacy factories return the legacy_
or Legacy
versions of the functions or classes defined in the module.
This is what it does atm and I call this the "(c)import
your own legacy :crossed_swords: " solution :D.
For tests I've made a separate module thinc.tests.legacy
, which again replicates the corresponding tests and imports from thinc.legacy
. In this case its just a copy of the old test_loss.py
removing the tests for which legacy versions do not exist.
We discussed this a while back, but I think this PR would be a good opportunity to replace the overloaded CategoricalCrossEntropy
class by splitting the non-legacy classes into two variants: SparseCategoricalCrossEntropy
and CategoricalCrossEntropy
(and their sequence counterparts) like some other libraries (e.g. Torch) do. This simplifies the internals a lot, since you don't need a big decision tree for 1D vs. 2D input.
Also, some features can be specific to one variant. I still think one-hot label smoothing for 2D arrays has possible edge cases and users should just use sparse label representations for one-hot label smoothing [1]. If we have a separate class, we can avoid that users accidentally enable label smoothing by just not provideing that kwarg.
[1] I think label smoothing might be useful for 2D arrays, but we shouldn't just restrict it to one-hot arrays. I can imagine scenarios where the annotation say that with probability 0.6 that it is label A and 0.4 that it is label B. Then label smoothing would be a some method to increase the entropy of the distribution (where the one-hot case is just a specialization of the general method).
We discussed this a while back, but I think this PR would be a good opportunity to replace the overloaded
CategoricalCrossEntropy
class by splitting the non-legacy classes into two variants:SparseCategoricalCrossEntropy
andCategoricalCrossEntropy
(and their sequence counterparts) like some other libraries (e.g. Torch) do. This simplifies the internals a lot, since you don't need a big decision tree for 1D vs. 2D input.Also, some features can be specific to one variant. I still think one-hot label smoothing for 2D arrays has possible edge cases and users should just use sparse label representations for one-hot label smoothing [1]. If we have a separate class, we can avoid that users accidentally enable label smoothing by just not provideing that kwarg.
[1] I think label smoothing might be useful for 2D arrays, but we shouldn't just restrict it to one-hot arrays. I can imagine scenarios where the annotation say that with probability 0.6 that it is label A and 0.4 that it is label B. Then label smoothing would be a some method to increase the entropy of the distribution (where the one-hot case is just a specialization of the general method).
I used to be against the idea of having this added to the current PR, but since -- as you are saying -- we are already working out the legacy and the PR is huge to begin with, I think it does make sense to attempt the Sparse
.
@danieldk I was looking at the new version of the cross-entropy that handles the Ragged
as input. I think it makes a lot of sense, but I'm wondering if its a good idea to support sequence types then like Ragged
, Padded
, List[Floats2d]
, because I'm confused about why would then the target
always be Floats2d
? Like why would the user create a big Floats2d
for labels, but have the data in Ragged
, Padded
or List[Floats2d]
?
For Padded
it gets even weird I think with the Floats2d
as input because the Padded
itself is sorted according to lengths so then the Floats2d
should also be in that order I guess? I think its over complicating it a bit, because if we allows all these types for guesses
then its weird to not have the same for target
+ then for List[Floats2d]
and Padded
the users I think need to implement some smart logic to create the Floats2d
target.
Update: Discussed with @danieldk in chat to only support Ragged
as guesses
for now and see if we can get benefits from this in spaCy
before moving forward.
Adding this note here so that we don't forget - this PR should be rebased on/merged into v9
.