skorch icon indicating copy to clipboard operation
skorch copied to clipboard

Support for multiple criteria?

Open taketwo opened this issue 6 years ago • 23 comments

At the moment skorch does not explicitly support fitting with respect to multiple loss functions. Of course, one can define a composite loss function that computes individual losses internally and outputs a sum, but this does not allow to monitor loss components individually through skorch callbacks and/or history.

Perhaps NeuralNet can be made to accept a list (or a dict) of criteria?

...
criteria={'rec': ReconstructionLoss, 'con': ConcentrationLoss},
...

And the report would look something like this:

  epoch    train_rec_loss    train_con_loss     dur
-------  ----------------    ---------------    ------
      1            1.4439             1.1001    4.1064

Or, alternatively, composite loss functions may output a dict with individual losses and validation/train_step() may take care of summing them up.

I was wondering if you have already considered adding such a feature? Do you see any major obstacles? If you can provide me with some guidance, I'd be interested to implement this.

taketwo avatar Jun 05 '18 15:06 taketwo

Yes, we thought about that but haven't reached a conclusion yet. You could post the gist of your current solution, maybe we can find a way to simplify it.

In general, we proceeded along the line of "simple things should be simple, complex things should be possible". With regard to your problem, one could argue that it is not be a "simple thing" anymore, so we have to ask whether we want to support it directly.

Coming to the "complex things should be possible" part, it is our philosophy to have many small but well documented methods on NeuralNet so that subclassing it should not be too hard. If, however, you find that there is room for improvement -- and I see that your use case is not trivial to implement -- we welcome PRs or change requests.

With that out of the way, if I had to implement the feature of supporting multiple criteria, this is what I would do:

  • allow a list of (name, criterion) tuples to be passed to the criterion parameter (analogous to sklearn pipeline steps and our callbacks)
  • it should work with all PyTorch criteria
  • it must be possible to set_params on individual criteria
  • it must be easy to assign specific module parameters to specific criteria
  • there must be a good solution for aggregating the criteria (weighting?)
  • bonus points if it's easy to switch on or off monitoring specific criteria in the history
  • everything should be backward compatible

If all of that is possible without unduly increasing code complexity, I could see us including such a feature, but it's certainly no easy task. The advantage would be that, if well implemented, it is easier for the user and more flexible, but I'm afraid that the code could get very complex.

benjamin-work avatar Jun 06 '18 08:06 benjamin-work

Thanks for taking time to consider this. Your list of requirements makes a lot of sense and indeed suggests that the implementation might get very complex.

Thinking more about this, I realized that as complex as it is already, this "declarative" approach to specify a composite loss function is not flexible enough (at least for my current use case). I have a network with multiple outputs, and each individual loss function is concerned with a subset of them. Thus some additional logic is needed to dissect the network output tuple and feed it to the individual losses as appropriate. Doing this in a declarative way would make things even more complicated, and modifying individual losses to be able to accept entire network output tuple and only use what's needed would kill the purpose of having this declarative specification in the first place. Further, adding support to specify aggregation method and it's parameters (weights) will add even more complexity.

With this in mind, I think that the declarative approach is not good here. Besides, it contributes little to address my original "feature request" (which was to enable monitoring the individual losses, not to simplify loss function composition). Thus I think the alternative approach that I proposed has more potential.

Specifically, a user should implement a custom criterion class that incapsulates the individual losses and the aggregation logic and can deal with multi-output network (if needed). In addition to returning a single loss number it would output a dict of loss components. The functions on the NeuralNet side would recognize this case and record these components to the history. Also the callbacks will recognize when history has complex loss and output its components separately.

This requires considerably less modifications to NeuralNet. What do you think?

taketwo avatar Jun 06 '18 13:06 taketwo

The naïve approach would be to define two criteria and overwrite get_loss to log the individual loss values:

    def get_loss(self, y_pred, y_true, **kwargs):
        y_pred_a, y_true_a = y_pred[:, 0], y_true[:, 0]
        y_pred_b, y_true_b = y_pred[:, 1], y_true[:, 1]

        loss_a = self.criterion_a_(y_pred_a, y_true_a)
        loss_b = self.criterion_b_(y_pred_b, y_true_b)

        self.history.record_batch('loss_a', loss_a)
        self.history.record_batch('loss_b', loss_b)

        return loss_a + loss_b

This has the drawback that you'd have to overwrite initialize_criterion and introduce an additional criterion parameter. But as you already said one could also write a custom criterion that returns a dictionary with multiple losses:

    def get_loss(self, y_pred, y_true, **kwargs):
        losses = self.criterion_(y_pred, y_true)

        self.history.record_batch('loss_a', losses['a'])
        self.history.record_batch('loss_b', losses['b'])

        return losses['a'] + losses['b']

The only remaining problem with such an approach is that you need an additional scoring callback that logs the additional loss terms, e.g.

def loss_a_score(net, X=None, y=None):
    return net.history[-1, 'batches', -1, 'loss_a']
def loss_b_score(net, X=None, y=None):
    return net.history[-1, 'batches', -1, 'loss_b']

Net(callbacks=[
    BatchScoring(loss_a_socre, target_extractor=skorch.utils.noop),
    BatchScoring(loss_b_socre, target_extractor=skorch.utils.noop),
])

I think it would be helpful if we'd provide a convenience function to easily create such a loss scoring callback, e.g. skorch.utils.create_loss_scoring('train_loss', on_train=True).

ottonemo avatar Jun 06 '18 14:06 ottonemo

For me it does not feel right that get_loss() touches the history. This function is called from different steps (train, validation), depending on which the record name should be different. Further, the final summed loss is recorded in the fitting loop, would be strange if its components are recorded elsewhere.

I rather see something like this inside the train_step():

...
losses = self.get_loss(y_pred, yi, X=Xi, training=True)
loss = sum(losses.values()) if isinstance(losses, dict) else losses
loss.backward()
...
return {
    'loss': losses,
    'y_pred': y_pred,
}

The loss (now a dict) will be recorded as usual inside the fitting loop under train_loss and valid_loss respectively. The only remaining thing is to modify the default callbacks to be able to handle history entries that are dicts.

taketwo avatar Jun 06 '18 15:06 taketwo

In general, ottonemo's 2nd suggestion looks good enough to me for prototyping. For bringing your model to production, I see why you would want a more robust solution.

For me it does not feel right that get_loss() touches the history

That's a good thing, because it means skorch allows you to build an intuition about what should and what shouldn't be done :D

This function is called from different steps (train, validation)

This can be handled with the training argument.

The loss (now a dict) will be recorded as usual inside the fitting loop under train_loss and valid_loss respectively. The only remaining thing is to modify the default callbacks to be able to handle history entries that are dicts.

I don't like this too much because everybody now has to deal with the possibility that train_loss/valid_loss are dicts, not only the default callbacks.

I have another suggestion, though, that I'd like to discuss. Perhaps we could change history such that if I update it with a dict, it unpacks the dict. Roughly:

history  # [{'train_loss': 0.5, 'valid_loss': 0.6}]
history.record({'score_a': 1, 'score_b': 2}]
history  # [{'train_loss': 0.5, 'valid_loss': 0.6, 'score_a': 1, 'score_b': 2}]

This would also require history to be able to extract torch values automatically (see this line) since we would no longer be able to call step['loss'].data.item().

The good thing about this would be that the consumers of history down the line don't need to adjust and we don't need to litter fit_loop with if ... else. The ugly thing about this solution is that it would require loss to always be a dict, and there would be no clear behavior for this case:

history.record('score', {'score_a': 1, 'score_b': 2}]

benjamin-work avatar Jun 07 '18 08:06 benjamin-work

@taketwo

This function is called from different steps (train, validation), depending on which the record name should be different.

This is certainly doable since get_loss has a training parameter which is set to True when called from train_step and False otherwise:

    def get_loss(self, y_pred, y_true, training=True, **kwargs):
        losses = self.criterion_(y_pred, y_true)

        name_suffix = '_train' if training else '_valid'
        self.history.record_batch('loss_a' + name_suffix, losses['a'])
        self.history.record_batch('loss_b' + name_suffix, losses['b'])

        return losses['a'] + losses['b']

@benjamin-work

I have another suggestion, though, that I'd like to discuss. Perhaps we could change history such that if I update it with a dict, it unpacks the dict. Roughly:

This does not regard train/validation though, the two history entries would share their name. We would need to add a suffix to the names depending on the step we are in, e.g.:

# in train_step:
history.record({(k + '_train'): v for k, v in losses})
# in validation_step:
history.record({(k + '_valid'): v for k, v in losses})

Since we'd have to iterate over the values anyway we could call .item() there as well.

ottonemo avatar Jun 07 '18 09:06 ottonemo

Since we'd have to iterate over the values anyway we could call .item() there as well.

'Tis true

benjamin-work avatar Jun 07 '18 11:06 benjamin-work

Thanks for the inputs. I see two discussion points here: a) how to record b) in which function to record. Let's deal with them separately.

How to record

I like Benjamin's idea of dictionary unpacking with automatic data extraction from tensors.

There would be no clear behavior for this case: history.record('score', {'score_a': 1, 'score_b': 2}]

I think it would be logical to adopt the following rule: if record() is called with a single value which is a dict, then this dict is merged into the history. If it is called with both attribute and a value which is a dict, then each of the dict keys is prefixed with the attribute before merging into the history. Example:

history  # [{'train_loss': 0.5, 'valid_loss': 0.6}]
history.record('train_loss', {'a': 1, 'b': 2})
history  # [{'train_loss': 0.5, 'valid_loss': 0.6, 'train_loss_a': 1, 'train_loss_b': 2}]

This fits nicely with @ottonemo's suggestion, simplifying history.record({(k + '_train'): v for k, v in losses}) to history.record('train_loss', losses).

The ugly thing about this solution is that it would require loss to always be a dict

Could you elaborate why?

In which function to record

This is certainly doable since get_loss has a training parameter

Agree. But I still don't see the benefit of manipulating history inside this function. A function named get_XYZ() should only compute XYZ and have no side-effects, i.e. be a pure function. What are your arguments against recording further down the road, in *_step() functions, or even further in fitting loop, as I suggested?

taketwo avatar Jun 07 '18 14:06 taketwo

I think it would be logical to adopt the following rule: ...

This looks promising. There might be a little problem, though, with the naming. Say my loss is {'loss': 1.23, 'other_loss': 4.56}, then the history would record it as {'train_loss_loss': 1.23, 'train_loss_other_loss': 4.56}. There would be no way to have my aggregate loss be called 'train_loss'.

The ugly thing about this solution is that it would require loss to always be a dict

Could you elaborate why?

This was for my first proposal, with the handling you proposed, it wouldn't be necessary.

But I still don't see the benefit of manipulating history inside this function

I don't think it would be a best practice. As mentioned earlier, this suggestion is rather meant for fast prototyping. We're not against having a cleaner solution down the line ;)

benjamin-work avatar Jun 07 '18 14:06 benjamin-work

There might be a little problem, though, with the naming. Say my loss is {'loss': 1.23, 'other_loss': 4.56}

I think the convention should be that the criterion class outputs either a single loss tensor, or a dictionary of individual losses (the keys should not have "loss" in the name). Then at the point where the history is recorded, one or two record() calls are made:

losses = self.get_loss(y_pred, yi, X=Xi, training=True)
net.history.record('train_loss', losses) 
if isinstance(losses, dict):
    aggregate_loss = sum(losses.values())
    net.history.record('train_loss', aggregate_loss)

taketwo avatar Jun 07 '18 14:06 taketwo

Okay, you suggest to move recording the losses to train_step/valid_step. I'm not quite sure if I like this. When overriding train_step, there would be even more things to think about. All other callback calls (except for on_grad_computed) are also in the fit_loop.

Then you propose to just sum the loss values. This could be too restrictive. I was thinking more along the lines of the criterion returning the aggregate loss using a specific key, and the individual loss components. That way, it is easier to control how the losses are aggregated.

Finally, I would really hope that we could get this line to work with both loss being a dictionary and a scalar:

self.history.record_batch('train_loss', step['loss'])

But right now I don't see how we could achieve that.

benjamin-work avatar Jun 08 '18 08:06 benjamin-work

Regarding the discussion where to place the loss logging I think we have to first make clear what the purpose of get_loss is. We refactored the computation of the loss to this method since it is used by both step functions. This was done to make the individual step functions shorter and easier to read. Therefore get_loss should be seen as an inline function, it exists to avoid code duplication and to be easily overwritten, not to provide a public API. It should be entirely possible to remove get_loss completely. Therefore I don't think the purity argument is particularly strong in this case (even though I generally agree with that argument).

I'm also in favor of keeping the housekeeping in the step functions at a minimum to make them easy to comprehend and overwrite, everything that is placed there should be meaningful and local. With locality I mean that when introducing code into the step functions we cannot rely on its side effects. Imagine we move the logging to the step functions, this would mean that we now rely on the logging code being present there. However since the step functions are meant to be overwritten this might have been done already in some code base somewhere which now, after upgrading, does not log losses anymore. Therefore adding side-effects here is problematic (on_gradients_computed is a good example for this).

Bottom line: logging additional things is OK since we don't (by default) rely on it later on but logging crucial things is not OK.

ottonemo avatar Jun 08 '18 10:06 ottonemo

Taking all your comments into account, here is my current proposal. I'll describe what changes are needed where, and then explain how this proposal addresses the points you've made.

  1. Criterion classes

By convention, the output of criterion classes should be either of:

  • tensor : loss (same as it is now)
  • tuple(tensor, dict) : aggregate loss and a dictionary with named components
  1. get_loss() function

Supports both types of output from criterion classes. Homogenizes it to the tuple(tensor, dict), where dict can be empty for simple loss functions.

  1. step() functions

The only change is that loss.backward() is replaced with loss[0].backward() because it is a tuple now.

  1. fitting loop

Small simplification, history is recorded with self.history.record_batch('train_loss', step['loss']) call, as Benjamin hoped.

  1. History's record()

If the value is a dict, prefix with the attribute name (as per my previous proposal). If the dict is empty, ignore. If the value is a tuple, recursively call record() with it's components.

Discussion

I'm also in favor of keeping the housekeeping in the step functions at a minimum

No housekeeping in step functions.

Therefore get_loss should be seen as an inline function, it exists to avoid code duplication and to be easily overwritten

This makes it a good place to perform the tensor -> tuple(tensor, dict) transformation I described.

Then you propose to just sum the loss values. This could be too restrictive.

Agree. Criterion is now responsible for aggregation.

Finally, I would really hope that we could get this line to work with both loss being a dictionary and a scalar: self.history.record_batch('train_loss', step['loss'])

... and even being a tuple of a scalar and a dict.

taketwo avatar Jun 08 '18 19:06 taketwo

Overall, I like your suggestions, especially since it only requires minimal changes in train_step and fit_loop.

What I would discuss is whether we want to have criterion_ return either a scalar or a scalar + dict, since I'm not too keen on functions that return a variable amount of results. As an alternative, I would suggest to return either a scalar or a dict, with a special key in the dict indicating the total loss (maybe 'loss'). This key could then be popped and the rest stays as is.

If the value is a tuple, recursively call record() with it's components.

Probably the same behavior for tuples and lists.

All in all, I'm also a little bit afraid that History could become very complicated, but if that only applies to cases that users typically don't touch, it would be fine.

BenjaminBossan avatar Jun 10 '18 09:06 BenjaminBossan

Regarding get_loss

In terms of get_loss I see two proposals:

    def get_loss_A(self, y_true, y_pred, X, training):
        loss = self.criterion_(y_true, y_pred)
        return loss if isinstance(loss, tuple) else (loss, {})

    def get_loss_B(self, y_true, y_pred, X, training):
        loss = self.criterion_(y_true, y_pred)
        return loss if isinstance(loss, dict) else {'loss': loss}

I like that get_loss_A separates optional from necessary (the dict is purely for logging purposes, not functional). But I agree with @benjamin-work that it makes handling more complicated, especially when recording the values to the history (we would need to support tuple unpacking as well as dict unpacking). With get_loss_B we only need to support dict in the history recording methods.

Of course this results in the problem that

self.history.record_batch('train_loss', step['loss'])

would then log train_loss_loss and possibly other terms such as train_loss_aux_loss_1 to the history, right? We could call

self.history.record_batch('train', step['loss'])

instead but this looks intransparent because it assumes the user knows that a) step['loss'] is a dict and b) that dicts are special for the history. This feels a bit awkward. So in the end, get_loss_A might be the better solution.

Overall

I'm trying to evaluate the benefit of this change relative to the effort. In the end the main feature we discuss here is that get_loss returns a dictionary with additional loss terms that should be logged. We are essentially providing a canonical way to write:

name_suffix = '_train' if training else '_valid'
self.history.record_batch('loss_a' + name_suffix, losses['a'])

I see that this change helps when you have multiple criteria that all depend on y. But as soon as you have a loss term that depends, for example, on X you'd have to overwrite get_loss anyway and then you can also record the loss values yourself. Sure, you can then use the same facility to return a dict with the additional values but I'm really not sure if it is really worth the additional complexity.

ottonemo avatar Jun 11 '18 12:06 ottonemo

I also agree with Benjamin's point regarding functions with variable number of return values. However in this case benefits outweigh deficits I believe. Also, we call criterion only inside get_loss, so handling variable number of return variables has to be implemented only once. So I'm certainly in favor of get_loss_A.

But as soon as you have a loss term that depends, for example, on X you'd have to overwrite get_loss anyway

Some of my losses already depend on X and I solve this by making X one of the network's outputs. My criterion is a custom class anyway, so it knows how to dispatch outputs to individual losses (depending on whether they need X or not).

With our current solution pretty much all the added complexity goes to the history recording internals. Of course this does require effort to implement and test, but I believe from the user perspective it does not complicate anything, only adds value.

taketwo avatar Jun 11 '18 15:06 taketwo

OK, let's recap.

  • We support multiple criteria since we can define our own criterion which, in turn, may return an aggregation of multiple criteria
  • A criterion has no access to the history so we need to return the individual losses so they can be written to the history
  • We do this by returning the aggregated loss alongside a dictionary with the names and values of the individual losses
  • To reduce overhead in *_step functions and other functions that follow get_loss we move the complexity of handling this to history.record
  • To easily add loss parts to the print log we provide a utility function skorch.utils.create_loss_scoring('train_loss_unsupervised', on_train=True)

What needs to be done

  • Change get_loss to consistently return (loss, dict)
  • Add support for tuple and dict in history.record
  • Implement skorch.utils.create_loss_scoring('train_loss', on_train=True)

The currently proposed solution sketches like this:

def get_loss(...): # -> (float, dict)
    loss = self.criterion_(y_true, y_pred)
    return loss if isinstance(loss, tuple) else (loss, {})

def train_step(...):
    # ...
    loss, loss_parts = self.get_loss(...)
    loss.backward()
    # ...
    return (loss, loss_parts)

# history.record
def record(k, v):
    if isinstance(v, tuple):
        [record(k, n) for n in v]
    elif isinstance(k, dict):
        [record(k + '_' + vk for vk, vv in v.items()]
    else:
        # normal record

Did I get everything right?

Edit: I want to note that adding this functionality to the history will prevent us from writing such data structures to the history. Not sure if that is good.

ottonemo avatar Jun 12 '18 12:06 ottonemo

Great summary!

The currently proposed solution sketches like this

Benjamin proposed to allow unpacking both tuples and lists in record().

# history.record
def record(k, v):
    if isinstance(v, tuple) or isinstance(v, list):
        [record(k, n) for n in v]
    elif isinstance(v, dict):
        [record(k + '_' + vk for vk, vv in v.items()]
    else:
        # normal record

I want to note that adding this functionality to the history will prevent us from writing such data structures to the history. Not sure if that is good.

That's a valid concern. When I proposed to add an "events" list to the history, Benjamin's answer was

I could see this getting confusing when there are many events during an epoch, and would prefer one column per item. This also makes the history easier to parse.

So I got an impression that his idea is to keep history simple and flat. But if you have a different vision, then this last point should be seriously considered.

taketwo avatar Jun 12 '18 13:06 taketwo

I tried implementing this and I stumbled upon the following issues:

  1. Whatever train_step returns as a loss is used by step_fn closure of train_step which means that if the loss is a composite value it needs to be unpacked there as well (e.g. return step['loss'][0]) which is OK but feels in-transparent

  2. Intransparency is a general issue: loss[0].backward() is unexpected

  3. Having (loss, dict) in step["loss"] leaves the conversion from torch to python open, this would need to be done before passing the values history.record or otherwise we would have to implement type checking in history as well (similar to multi indexing), which I would like to avoid. This makes the call to history.record quite bloated:

history.record('train_loss', (step['loss'][0].item(), {k: v.item() for k, v in step['loss'][1].items()}))
# alternatively
history.record('train_loss', step['loss'][0].item())
history.record('train_loss', {k: v.item() for k, v in step['loss'][1].items()}))

Maybe something like this but I'm still not too happy about this:


def train_step(...):
    loss, loss_parts = self.get_loss(y_pred, yi, X=Xi, training=True)
    return {
        'loss': loss,
        'parts': loss_parts,
        'y_pred': y_pred,
        }

def fit_loop(...):
    # ...
                self.history.record_batch('train_loss', step['loss'].item())
                self.history.record_batch('train_loss', step.get('parts', {}))
    # ...

loss_parts could then be already converted to numpy. However if we'd introduce a feature like this I think the loss parts should be usable for back-propagation as well (which would not be possible if we expect the values to be numpy floats).

We could also think about wrapping the values in a transparent container object which is recognized by the history but this can lead to very bizarre errors (also, the loss wouldn't be a tensor anymore).

ottonemo avatar Jun 12 '18 16:06 ottonemo

We would have to implement type checking in history as well (similar to multi indexing), which I would like to avoid.

Why not? Nobody should ever write tensors to the history. If somebody does so, then it's a mistake. (GPU memory gets blocked, eventually leading to out-of-memory error.) If type checking/conversion is built-in to the history, that's an additional safety net, isn't it a good thing?

:+1: for your last proposal. The last hunk can be reduced to:

    self.history.record_batch('train_loss', step['loss'], step['parts'])

assuming that conversion is implemented, empty dicts are ignored, and record_batch() can accept multiple values.

taketwo avatar Jun 13 '18 14:06 taketwo

Any news on this? That would be extremely helpful! Great library btw, thx :)

YannDubs avatar May 22 '19 10:05 YannDubs

No news on this front unfortunately. What exactly is the problem you need to solve?

As mentioned in an earlier comment, a quick and dirty solution is to override get_loss (which you probably need to do anyway) and record individual loss terms to the history right there (using the training parameter to differentiate between train and inference).

BenjaminBossan avatar May 22 '19 21:05 BenjaminBossan

Hi!

I solved the problem by subclassing nn.Module and used three differently weighted cross entropy losses, that are calculated and summed in the forward method.

However, I struggle with BatchScoring with a metric such as accuracy. skorch breaks with the error:

Traceback (most recent call last):
  File "train.py", line 150, in <module>
    main()
  File "train.py", line 146, in main
    train(args.data_folder, args.out_model)
  File "train.py", line 94, in train
    net.fit(train_dataset)
  File "/home/kacper/anaconda3/envs/usg/lib/python3.6/site-packages/skorch/net.py", line 842, in fit
    self.partial_fit(X, y, **fit_params)
  File "/home/kacper/anaconda3/envs/usg/lib/python3.6/site-packages/skorch/net.py", line 801, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "/home/kacper/anaconda3/envs/usg/lib/python3.6/site-packages/skorch/net.py", line 742, in fit_loop
    self.notify('on_batch_end', X=Xi, y=yi_res, training=True, **step)
  File "/home/kacper/anaconda3/envs/usg/lib/python3.6/site-packages/skorch/net.py", line 287, in notify
    getattr(cb, method_name)(self, **cb_kwargs)
  File "/home/kacper/anaconda3/envs/usg/lib/python3.6/site-packages/skorch/callbacks/scoring.py", line 197, in on_batch_end
    y = None if y is None else self.target_extractor(y)
  File "/home/kacper/anaconda3/envs/usg/lib/python3.6/site-packages/skorch/utils.py", line 110, in to_numpy
    raise TypeError("Cannot convert this data type to a numpy array.")

The reason behind is that skorch tries to convert torch.Tensor to numpy matrix. However, my network returns a tuple of outputs from three different branches. Each output is of different dimensions, so I cannot torch.stack to create a single output and omit the problem in this way. Do you have an idea of how it can be solved?

Edit: I just passed lambda x: x to target_extractor in BatchScoring and now handle it internally in the metric definition.

kacper1095 avatar May 29 '19 12:05 kacper1095