thinc icon indicating copy to clipboard operation
thinc copied to clipboard

Cross entropy fix

Open kadarakos opened this issue 2 years ago • 10 comments

This PR makes the CategoricalCrossentropy loss more strict only allowing guesses and truths that represent exclusive classes and it fixes the computation of the loss value. It also changes all the tests that involve the CategoricalCrossentropy and SequenceCategoricalCrossentropy.

kadarakos avatar Apr 28 '22 16:04 kadarakos

This PR makes the CategoricalCrossentropy loss more strict only allowing guesses and truths that represent exclusive classes

Why can't / shouldn't it (also) support multi-label classification?

svlandeg avatar May 24 '22 16:05 svlandeg

This PR makes the CategoricalCrossentropy loss more strict only allowing guesses and truths that represent exclusive classes

Why can't / shouldn't it (also) support multi-label classification?

I was thinking that I would have a separate loss for that for two main reasons.

The simple one was just that the loss computation itself is different. So if the rows add up to 1 then its the way its done now, otherwise it would have to compute the binary cross-entropy. The gradient is the same though that's true.

The other main reason was just the confusing usage pattern from my perspective. So it takes Ints1d, List[str], List[int] or Floats2d as truths and it also does label-smoothing. The first three types of input and label-smoothing doesn't make sense for the multi-label binary cross entropy case. The only way to make it do binary cross entropy is to give it Floats2d and label-smoothing = False and in all other configurations it does categorical.

I just thought its nicer to implement a binary cross-entropy separately that can take different kinds of truths that are more appropriate for multiple labels per sample and has its own docs plus it always computes that same loss. For example it could take a List[Dict[str, float]] (textcat-multilabel could directly pass down this one i think) or List[List[int]], List[List[str]].

kadarakos avatar May 25 '22 07:05 kadarakos

Thanks for the additional explanation, and agreed!

svlandeg avatar May 25 '22 07:05 svlandeg

The simple one was just that the loss computation itself is different. So if the rows add up to 1 then its the way its done now, otherwise it would have to compute the binary cross-entropy. The gradient is the same though that's true.

But is CategoricalCrossEntropy currently used anywhere for multi-class classification? If so, this change would break those cases? I am not sure, I haven't really looked much yet at multi-class classification in Thinc/spaCy.

danieldk avatar May 25 '22 08:05 danieldk

I don't think it's being used in spaCy right now, but it should. And when we do, I'm proposing we work via specific versions, e.g.

loss_function = registry.get("losses", "CategoricalCrossentropy.v3")

Instead of a direct import. This will make it easier to keep the same behaviour if people want it, but to allow others to upgrade if they chose so.

It might still be that other users of Thinc are using the current CategoricalCrossentropy and would get bitten by a change in numerical values when they're logging results across training runs, or get bitten because the functionality is now more strict as Daniël mentions. Then we'd be able to present the above solution to them for making sure they have access to the same functionality.

The fact that we need to change the tests here, is a huge red flag that this can be very breaking - which is why I'm proposing we lean into the framework of versioning that we've already set up anyway. Then we can keep the tests largely the same for the old functionality but write new ones for the new functionality.

svlandeg avatar May 25 '22 09:05 svlandeg

I don't think it's being used in spaCy right now, but it should. And when we do, I'm proposing we work via specific versions, e.g.

loss_function = registry.get("losses", "CategoricalCrossentropy.v3")

Instead of a direct import. This will make it easier to keep the same behaviour if people want it, but to allow others to upgrade if they chose so.

It might still be that other users of Thinc are using the current CategoricalCrossentropy and would get bitten by a change in numerical values when they're logging results across training runs, or get bitten because the functionality is now more strict as Daniël mentions. Then we'd be able to present the above solution to them for making sure they have access to the same functionality.

The fact that we need to change the tests here, is a huge red flag that this can be very breaking - which is why I'm proposing we lean into the framework of versioning that we've already set up anyway. Then we can keep the tests largely the same for the old functionality but write new ones for the new functionality.

Added a suggestion for handling legacy stuff in general, but I didn't exactly follow your suggestion, but let me know if you'd like that better. Maybe I did follow your suggestion though?! I dunno.

The idea is that there is the thinc.legacy as you've said. So thinc.legacy at the moment only has loss.py meaning that the thinc.loss should import from thinc.legacy.loss if it would like to run older versions. The other thing that I'm suggesting here is the when there is a function called func or class called Class then the import should be from thinc.legacy import legacy_func, LegacyClass. Then the legacy factories return the legacy_ or Legacy versions of the functions or classes defined in the module.

This is what it does atm and I call this the "(c)import your own legacy :crossed_swords: " solution :D.

For tests I've made a separate module thinc.tests.legacy, which again replicates the corresponding tests and imports from thinc.legacy. In this case its just a copy of the old test_loss.py removing the tests for which legacy versions do not exist.

kadarakos avatar May 25 '22 12:05 kadarakos

We discussed this a while back, but I think this PR would be a good opportunity to replace the overloaded CategoricalCrossEntropy class by splitting the non-legacy classes into two variants: SparseCategoricalCrossEntropy and CategoricalCrossEntropy (and their sequence counterparts) like some other libraries (e.g. Torch) do. This simplifies the internals a lot, since you don't need a big decision tree for 1D vs. 2D input.

Also, some features can be specific to one variant. I still think one-hot label smoothing for 2D arrays has possible edge cases and users should just use sparse label representations for one-hot label smoothing [1]. If we have a separate class, we can avoid that users accidentally enable label smoothing by just not provideing that kwarg.

[1] I think label smoothing might be useful for 2D arrays, but we shouldn't just restrict it to one-hot arrays. I can imagine scenarios where the annotation say that with probability 0.6 that it is label A and 0.4 that it is label B. Then label smoothing would be a some method to increase the entropy of the distribution (where the one-hot case is just a specialization of the general method).

danieldk avatar Jun 17 '22 06:06 danieldk

We discussed this a while back, but I think this PR would be a good opportunity to replace the overloaded CategoricalCrossEntropy class by splitting the non-legacy classes into two variants: SparseCategoricalCrossEntropy and CategoricalCrossEntropy (and their sequence counterparts) like some other libraries (e.g. Torch) do. This simplifies the internals a lot, since you don't need a big decision tree for 1D vs. 2D input.

Also, some features can be specific to one variant. I still think one-hot label smoothing for 2D arrays has possible edge cases and users should just use sparse label representations for one-hot label smoothing [1]. If we have a separate class, we can avoid that users accidentally enable label smoothing by just not provideing that kwarg.

[1] I think label smoothing might be useful for 2D arrays, but we shouldn't just restrict it to one-hot arrays. I can imagine scenarios where the annotation say that with probability 0.6 that it is label A and 0.4 that it is label B. Then label smoothing would be a some method to increase the entropy of the distribution (where the one-hot case is just a specialization of the general method).

I used to be against the idea of having this added to the current PR, but since -- as you are saying -- we are already working out the legacy and the PR is huge to begin with, I think it does make sense to attempt the Sparse.

kadarakos avatar Jun 17 '22 07:06 kadarakos

@danieldk I was looking at the new version of the cross-entropy that handles the Ragged as input. I think it makes a lot of sense, but I'm wondering if its a good idea to support sequence types then like Ragged, Padded, List[Floats2d], because I'm confused about why would then the target always be Floats2d? Like why would the user create a big Floats2d for labels, but have the data in Ragged, Padded or List[Floats2d] ?

For Padded it gets even weird I think with the Floats2d as input because the Padded itself is sorted according to lengths so then the Floats2d should also be in that order I guess? I think its over complicating it a bit, because if we allows all these types for guesses then its weird to not have the same for target + then for List[Floats2d] and Padded the users I think need to implement some smart logic to create the Floats2d target.

Update: Discussed with @danieldk in chat to only support Ragged as guesses for now and see if we can get benefits from this in spaCy before moving forward.

kadarakos avatar Sep 01 '22 09:09 kadarakos

Adding this note here so that we don't forget - this PR should be rebased on/merged into v9.

shadeMe avatar Sep 14 '22 15:09 shadeMe