blocks Refactor `RemoveNotFinite` to Remember If There Were NaNs

There should be a possibility to avoid updating the parameter with NaN gradient.

Jan 27 '15 08:01 rizar

See discussion in #343.

Feb 25 '15 19:02 rizar

http://deeplearning.net/software/pylearn2/library/devtools.html#module-pylearn2.devtools.nan_guard On Feb 25, 2015 2:40 PM, "Dmitry Bogdanov" [email protected] wrote:

See discussion in #343 https://github.com/bartvm/blocks/pull/343.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/162#issuecomment-76039136.

Feb 25 '15 19:02 bartvm

And if you want to avoid updating instead of stopping, the new plugin by @memimo does that, no? On Feb 25, 2015 2:40 PM, "Dmitry Bogdanov" [email protected] wrote:

See discussion in #343 https://github.com/bartvm/blocks/pull/343.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/162#issuecomment-76039136.

Feb 25 '15 19:02 bartvm

I want to avoid updating but raise an exception or warning. The new step rule just hides the NaNs.

Feb 25 '15 19:02 rizar

Relevant: http://deeplearning.net/software/pylearn2/library/devtools.html#pylearn2.devtools.nan_guard.NanGuardMode

Feb 25 '15 19:02 dwf

The new step rule avoids updating of you set the scaling to 1, so that + NanGuardMode might be enough?

Feb 25 '15 20:02 bartvm

I don't like that by default blocks overrides the model file with parameters corrupted with NaNs and I have to retrain the model from scratch. So throwing an exception seems reasonable in some cases.

Feb 25 '15 22:02 dmitriy-serdyuk

Is it better to thow an exception in Dump class when trying to serialize NaN?

Feb 25 '15 22:02 dmitriy-serdyuk

You will probably want to be very careful about that to make sure you avoid partially serializing something before you throw that exception.

Feb 25 '15 22:02 dwf

@dwf Yes, sure. It can be solved with a temp file.

Feb 25 '15 22:02 dmitriy-serdyuk

While this is true that a temp file would do the job, I feel it much more naturally to handle these things in the algorithm. This way when you get NaNs an exception will be raised and the main loop from the point just before you got NaN will be saved in a separate file.

Feb 26 '15 10:02 rizar

Is the NanGuardMode as fast as fast_run?

Feb 26 '15 10:02 rizar

Theoretically, no solution is ever going to be as fast as not checking, right? Even if you could optimize the hell out of it, at the very least you still need to make an element-wise comparison for each parameter which you otherwise wouldn't have to.

You won't want to handle this for each possible algorithm, but you can probably add it to the MainLoop. If the error occurs, catch it, print an informative warning, and go straight to after_training extensions (instead of on_error ones).

Feb 26 '15 12:02 bartvm

I did not mean "exactly as fast" as fast_run, I meant that I would not like to break optimizations or use less optimization compared to fast_run.

Feb 26 '15 13:02 rizar

I think we'll have to ask @nouiz about the details, but as far as I can tell from the code, Pylearn2's NaNGuard compiles with the same mode, so optimizations should still be applied. My best guess is that the NaN checks occur in between the ops of the final, optimized graph. That's just an informed guess though :)

Feb 26 '15 13:02 bartvm

Anyway, I think I know a super easy solution! We should simply add a shared variable to RemoveNotFinite which would be 1 if there were infinite elements in the step and 0 if there weren't. The user can do whatever he wants with it. This is very Block-ish, I think.

Feb 26 '15 13:02 rizar

Sounds good to me, but just to clarify: This does mean that the user won't be able to retrieve the state of the parameters from right before NaN occurred, right? First the rescaling is applied to those shared variables whose updates contained NaN, and only after that the user gets a chance in the after_batch extensions to check whether a scaled update occurred because of a NaN.

My guess is that this is fine. However, in practice NaNs might occur in the same cases where other shared variables are updated with finite but extremely large steps, meaning that you would still have a model in a bad state. However, at least it won't contain NaNs :)

Feb 26 '15 13:02 bartvm

By choosing the right scaling constant the parameters that would contain NaNs will be not corrupted. Yours is a good point, that other parameters might be damaged. I think we should also add an option to this step rule to operate globally on all parameters, that is if any parameter has NaN, all the parameters are scaled instead of being updated. It seems to me that this is the solution, right?

Feb 26 '15 13:02 rizar

Yeah, that sounds easy enough, just do something like tensor.any([tensor.any(tensor.or_(tensor.isnan(step), tensor.isinf(step)) for step in steps]) -> scale all parameters (with 1.0 potentially)

Feb 26 '15 14:02 bartvm

Okay, then this ticket becomes a CCW level refactoring of RemoveNotFinite.

Feb 26 '15 14:02 rizar

If you need me input, come see me.

On Thu, Feb 26, 2015 at 9:13 AM, Dmitry Bogdanov [email protected] wrote:

Okay, then this ticket becomes a CCW level refactoring of RemoveNotFinite.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/162#issuecomment-76184263.

Feb 26 '15 19:02 nouiz

By the way, documentation RemoveNotFinite seems misleading to me. It is claimed that "parameters are replaced with scaled version", which is in the end true, but the thing is that the scaling coefficient is 1 - self.scaler, not self.scaler!

Mar 13 '15 09:03 rizar