Refactor `RemoveNotFinite` to Remember If There Were NaNs
There should be a possibility to avoid updating the parameter with NaN gradient.
See discussion in #343.
http://deeplearning.net/software/pylearn2/library/devtools.html#module-pylearn2.devtools.nan_guard On Feb 25, 2015 2:40 PM, "Dmitry Bogdanov" [email protected] wrote:
See discussion in #343 https://github.com/bartvm/blocks/pull/343.
— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/162#issuecomment-76039136.
And if you want to avoid updating instead of stopping, the new plugin by @memimo does that, no? On Feb 25, 2015 2:40 PM, "Dmitry Bogdanov" [email protected] wrote:
See discussion in #343 https://github.com/bartvm/blocks/pull/343.
— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/162#issuecomment-76039136.
I want to avoid updating but raise an exception or warning. The new step rule just hides the NaNs.
Relevant: http://deeplearning.net/software/pylearn2/library/devtools.html#pylearn2.devtools.nan_guard.NanGuardMode
The new step rule avoids updating of you set the scaling to 1, so that + NanGuardMode might be enough?
I don't like that by default blocks overrides the model file with parameters corrupted with NaNs and I have to retrain the model from scratch. So throwing an exception seems reasonable in some cases.
Is it better to thow an exception in Dump class when trying to serialize NaN?
You will probably want to be very careful about that to make sure you avoid partially serializing something before you throw that exception.
@dwf Yes, sure. It can be solved with a temp file.
While this is true that a temp file would do the job, I feel it much more naturally to handle these things in the algorithm. This way when you get NaNs an exception will be raised and the main loop from the point just before you got NaN will be saved in a separate file.
Is the NanGuardMode as fast as fast_run?
Theoretically, no solution is ever going to be as fast as not checking, right? Even if you could optimize the hell out of it, at the very least you still need to make an element-wise comparison for each parameter which you otherwise wouldn't have to.
You won't want to handle this for each possible algorithm, but you can probably add it to the MainLoop. If the error occurs, catch it, print an informative warning, and go straight to after_training extensions (instead of on_error ones).
I did not mean "exactly as fast" as fast_run, I meant that I would not like to break optimizations or use less optimization compared to fast_run.
I think we'll have to ask @nouiz about the details, but as far as I can tell from the code, Pylearn2's NaNGuard compiles with the same mode, so optimizations should still be applied. My best guess is that the NaN checks occur in between the ops of the final, optimized graph. That's just an informed guess though :)
Anyway, I think I know a super easy solution! We should simply add a shared variable to RemoveNotFinite which would be 1 if there were infinite elements in the step and 0 if there weren't. The user can do whatever he wants with it. This is very Block-ish, I think.
Sounds good to me, but just to clarify: This does mean that the user won't be able to retrieve the state of the parameters from right before NaN occurred, right? First the rescaling is applied to those shared variables whose updates contained NaN, and only after that the user gets a chance in the after_batch extensions to check whether a scaled update occurred because of a NaN.
My guess is that this is fine. However, in practice NaNs might occur in the same cases where other shared variables are updated with finite but extremely large steps, meaning that you would still have a model in a bad state. However, at least it won't contain NaNs :)
By choosing the right scaling constant the parameters that would contain NaNs will be not corrupted. Yours is a good point, that other parameters might be damaged. I think we should also add an option to this step rule to operate globally on all parameters, that is if any parameter has NaN, all the parameters are scaled instead of being updated. It seems to me that this is the solution, right?
Yeah, that sounds easy enough, just do something like tensor.any([tensor.any(tensor.or_(tensor.isnan(step), tensor.isinf(step)) for step in steps]) -> scale all parameters (with 1.0 potentially)
Okay, then this ticket becomes a CCW level refactoring of RemoveNotFinite.
If you need me input, come see me.
On Thu, Feb 26, 2015 at 9:13 AM, Dmitry Bogdanov [email protected] wrote:
Okay, then this ticket becomes a CCW level refactoring of RemoveNotFinite.
— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/162#issuecomment-76184263.
By the way, documentation RemoveNotFinite seems misleading to me. It is claimed that "parameters are replaced with scaled version", which is in the end true, but the thing is that the scaling coefficient is 1 - self.scaler, not self.scaler!