blocks icon indicating copy to clipboard operation
blocks copied to clipboard

Automatically record auxiliary variables

Open adbrebs opened this issue 9 years ago • 22 comments

Another question that arose from the tutorial. The norms of the parameters of a network seem to be recorded by default as auxiliary variables. As a low level research library, we should probably avoid doing things under the hood and let the user decide precisely the auxiliary variables he/shewants to record, no? I know that in Pylearn2, many auxiliary variables are automatically recorded but most of the time the user doesn't need them.

adbrebs avatar Mar 01 '15 14:03 adbrebs

Complete agree :) Which is why we don't monitor those auxiliary variables automatically, we just create them. Only the values of variables explicitly passed to e.g. DataStreamMonitoring are recorded. Does that make sense?

bartvm avatar Mar 01 '15 16:03 bartvm

Almost :). If they are not used anywhere, why creating them by default?

adbrebs avatar Mar 01 '15 16:03 adbrebs

I agree with @adbrebs that these auxiliary variables are redundant, because fetching parameters with Selector or ComputationGraph and taking l2_norm is a piece of cake.

rizar avatar Mar 01 '15 16:03 rizar

@adbrebs Because the user might want to use them.

@rizar I disagree. Right now they may seem redundant because there is only 1, which is easy to create yourself. Eventually we could have a large scala of auxiliary variables. Pylearn2 for example has min, max and mean row and columnwise entries as well as norms. Now let's say I'm suspecting something weird is happening with some of my matrices, then I want to be able to pass linear.auxiliary_variables to the DataStreamMonitoring and get every possible piece of information I can about it. I don't want to be forced to write a lengthy [linear.W.mean(axis=1), linear.W.mean(axis=0), linear.W.max(axis=1), linear.W.min(axis=1), linear.W.max(axis=0), linear.W.min(axis=0), tensor.sqrt(tensor.sqr(linear.W).sum(axis=0)), tensor.sqrt(tensor.sqr(linear.W).sum(axis=1))]. Now imagine that I'm doing some optimization research, and I just want to monitor as much as I can to analyze afterwards. With the current setup that would be a simple cg.auxiliary_variables. Removing this would mean writing custom code that repeats that lengthy line for each brick...

bartvm avatar Mar 01 '15 16:03 bartvm

I understand that it can be more convenient in some cases but, as a matter of principle, I just don't like the idea of creating objects/variables under the hood that, most of the time, won't be used by the user.

adbrebs avatar Mar 01 '15 17:03 adbrebs

@bartvm , you can just as well factor our this logic in separate file bricks/debug.py:

def parameter_statistics(parameter): 
    return parameter.mean(), parameter.max(), parameter.min()
...
DataStreamMonitoring([cost] + parameter_statistics(linear.W), ...)

This makes more sense IMHO rather than hard-coding an arbitrary set of observables right in the Blocks core code.

rizar avatar Mar 01 '15 17:03 rizar

Although I see your point, you could also factor that out by calling parameter_statistics directly in the application call. Otherwise it's still a mess if you have 25 linear bricks spread throughout your model. It becomes something like [cost] + list(chain(*[parameter_statistics(brick) for brick in model.bricks if isinstance(brick, Linear)]) + cg.auxiliary_variables).

Also, are you proposing not saving any monitoring variables anymore? Because there's also the case where auxiliary variables have aggregation schemes attached to them that rely on other variables which aren't easy to access (e.g. the mean activation per example).

bartvm avatar Mar 01 '15 18:03 bartvm

No, calling parameter_statistics in application call has the same shortcoming: you add one and only one set of observables every time. Of course you can add various roles to these statistics and then select only those with the roles MIN, MAX or etc., but I think this is an overkill.

If the case frequent case of monitoring parameters of linear bricks, you can as well have def linear_weights_statistics in blocks/debug.py. This function can accept keyword arguments, for instance verbosity levels, or switches for certain groups of observables. You can call it as follows

DataStreamMonitoring(
    [cost] + linear_weights_statistics(model.bricks), ...)

which I think is concise enough.

In general there seems to be two alternative ways of building quantities for monitoring: in the brick code or by external routines. I think that when possible, the second way should be preferred, as it is more modular. Highly specific variables can be added in the bricks, it seems to me that bricks from papers and blocks-contrib are more likely consumers of such features. For instance, Jan doing his research of a custom attention mechanism could indeed benefit a lot from computing the alignment penalty and stuff like this right in the code of the application method and attaching it to the computation graph with the right aggregation scheme.

If you mean activation per example monitoring for let's say MLP brick, I am afraid it is impossible to find an universal solution for this seemingly simple problem, because we do not know which axis corresponds to the number of examples :( For FF networks it is the first one, for recurrent it is typically the second one.

rizar avatar Mar 01 '15 19:03 rizar

I guess you could have a activation_statistics(application_call, batch_axis=0) function to let people easily create this?

Although I agree in principle, I am still slightly worried that monitoring will become a hassle. There's use cases in which we care about only a handful of variables, but there are also cases where people could actually prefer Pylearn2's behavior of just being able to say "monitor everything that could be useful for me, throw it in a log, and I'll do the analysis afterwards". That's a reasonable workflow as opposed to "oh, this and this isn't working, let me add monitors for this and that variable and re-run 2 hours later oh, maybe i should have monitored this and that too waits 2 more hours".

So I can live with factoring this stuff out and making it more explicit (explicit is good!) as long as we make very sure that there are short and easily accessible convenience functions that make monitoring all reasonable monitoring quantities easy.

One last thing: debug.py is a really bad name, and sounds like debugging programming errors. It's also not always for debugging a model's training progress, sometimes you might actually just be studying these quantities. Something like stats.py, monitor_stats.py, monitoring_expressions.py, etc. maybe?

bartvm avatar Mar 01 '15 20:03 bartvm

but there are also cases where people could actually prefer Pylearn2's behavior of just being able to say "monitor everything that could be useful for me, throw it in a log, and I'll do the analysis afterwards".

I fully agree with that. But in Blocks this is implementable via an extension DefaultModelMonitoring that can do its best to mimic PyLearn2 behaviour. We did a really good job of supporting model introspection in Blocks, which makes it possible to factor our a lot of logic from model code.

By the way, I guess such an extension would be quite popular, if its author does a good job of deciding what are the things a person typically need to understand why his training job failed. I would be happy to use it!

activation_statistics sounds good to me. I can imagine using it like activation_statistics(VariableFilter(application=mlp.apply)(cg)).

rizar avatar Mar 01 '15 20:03 rizar

Sorry, forgot to say that I find both monitor_stats.py and monitor_expressions.py acceptable, though the last one is very long.

rizar avatar Mar 01 '15 20:03 rizar

Ah, I like the idea of a DefaultMonitor extension that just walks the graph and creates monitoring variables for everything that it can find, it's a very good trade-off between usability while not having monitoring values littered throughout the Bricks code.

bartvm avatar Mar 01 '15 20:03 bartvm

That sounds like a good idea.

In the same vein: I started a "DetailedStepMonitor" that could be part of a CompositeStep and would create max/median/mean variables for each parameter receiving step updates. Very straight forward. The only issue I encountered was that variables typically do not have a "fully qualified name"/which made it hard to identify them.

jbornschein avatar Mar 02 '15 04:03 jbornschein

part of a CompositeStep

I am not sure we should mix algorithms and their monitoring. Maybe if you share your situation with us we can come up with a better solution.

rizar avatar Mar 02 '15 07:03 rizar

Correction: ... that would create max/median/mean variables for each parameterupdate per StepRule.

I had unstable learning dynamics and it was very useful to see the magnitude of individual parameter updates at various stages -- e.g. before RMSProp. In other words: The same as total_gradient_norm, just not total and at a self-selected stage during step-processing.

It was important to see individual parameters because some of them would parameterize different standard-deviations.. seeing which one spiraled out of control first was much more useful than just looking at the total norm.

jbornschein avatar Mar 02 '15 12:03 jbornschein

Okay, I see your point. You want to monitor intermediate results of CompositeStep.compute_steps, which is a perfectly legitimate desire.

One way we could support it is to save these intermediate steps in an attribute of the CompositeStep object:

step_rule = CompositeStep(...)
step_rule.intemediate_steps

rizar avatar Mar 02 '15 13:03 rizar

That would be an option.

I see two aspects that might make it desirable to explicitly activate this detailed monitoring:

  1. these are a lot of variables: per step-rule, per variable, per {max,mean,median}. It is probably a common case that the user only wants to select a subset (a slice) of these.
  2. these variables would usually only be monitored in TrainingDataMonitoring. Of course the user might choose to monitor them on a testset -- but that would trigger gradient computations. So maybe they should not be tagged with the AUXILIARY role?

jbornschein avatar Mar 02 '15 15:03 jbornschein

Sure, such monitoring will have to be activated explicitly. Just to make it clear: I just proposed a way to save the variables should be monitored, all the rest will be done by the extension.

I am not sure I understand your logic in (2). It seems like you expect that all auxiliary variables will be evaluated on the validation dataset, but this is not true.

However reading (2) I cam up with an idea that in fact the step computation is also a computation graph and can be managed by the ComputationGraph object! Seems like @dwf at some point proposed something like this. The intermediate results of step computation can be given roles, the step rules can serve as annotations: we can use very similar machinery. This does introduce a dependency on graph.py for blocks/algorithms. But it seems like a very good way to avoid having weird heterogeneous interfaces for the step rules. What do you think, @bartvm?

rizar avatar Mar 02 '15 18:03 rizar

I'll have to read through the discussion in a bit more detail, but if what you're proposing is adding roles to variables created by the step rules so that an extension can easily find them and monitor them, that sounds okay to me.

There will be a dependency on graph.py as you said, which is too bad, but it's "optional" in the sense that in order to create a new algorithm you don't need to use it, and if you want to use the algorithm outside of blocks you just remove the relevant import/lines and everything should still work. It's not as copy-paste as the ideal, but considering the reduced code complexity (and interfaces that could break), I'm okay with that.

bartvm avatar Mar 02 '15 19:03 bartvm

Nice that you agree!

@jbornschein , let us know if you have any questions.

rizar avatar Mar 02 '15 19:03 rizar

@rizar Yes, you are right, having the AUXILIARY role does not imply they will be monitored on the test/verifikation dataset. I just wanted to point out that maybe we should indeed have another role for step-monitor-variables because it is probably annoying if there are a lot of them and you can't select them easily.

I'm still under the impression that having a DetailedStepMonior that can be placed somewhere between the StepRules and that adds monitoring variables is a very natural user interface. Yes, it mixes monitoring with update rules, but a StepRule receives exactly the information we are interested in. And to me it feels quite natural to place 'a monitoring probe' in the StepRule pipeline.

I don't think I fully understand the interface/API you proposed: You would prefer a function/object that takes the complete computational graph as input, and a specification to which StepRules the monitoring should be applied? It would then filter the ComputationalGraph and look for 'StepRule' results that match the specification and add monitoring variables to it?

jbornschein avatar Mar 02 '15 20:03 jbornschein

Right, we can and we should have various subroles for AUXILIARY, for instance DIAGNOSTIC, STEP_DIAGNOSTIC, etc. Expanding the role tree in depth is not a issue.

Do you want the DetailedStepMonitor to add monitoring variables to the step computation graph?

I think you understood me right: step rules should produce nicely annotated step computation graph, with nice names (both name and tag.name), roles (e.g. HYPERPARAMETER, STEP, ACCUMULATOR), annotations (step rules can inherit from Annotation and be added to tag.annotations). That allows us to access whatever we want by VariableFilter. Furthermore, one can write functions, which given the specification of which step rules should be monitored do their best to build a set of variables useful for monitoring, just as you suggest. Or, in case you have your custom learning rule, you can add variables with STEP_DIAGNOSTIC rule right to the step computation graph in compute_steps.

rizar avatar Mar 03 '15 14:03 rizar