ensmallen icon indicating copy to clipboard operation
ensmallen copied to clipboard

Better output when convergence check succeeds

Open rcurtin opened this issue 5 years ago • 20 comments

Issue description

When using callbacks like ens::ProgessBar or similar, and the optimization terminates partway through an epoch due to a convergence check, the output can be confusing:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75

but then output stops.

See the discussion at https://github.com/mlpack/mlpack/issues/2073 for many more details and how to reproduce.

Expected behavior

We should see if we can modify the ens::ProgressBar callback (and perhaps others) to give better output in these situations. Perhaps something like this would be an improvement:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75
Optimization converged! (value 1.33e-6, tolerance 1e-5)

Actual behavior

Instead the output just terminates:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75

This would be a good issue if you are not familiar with how ensmallen's callbacks work and would like to be. I don't have a direct route to a solution here, so my suggestion would be to investigate the current callback code, reproduce the issue, then think about the cleanest way to print convergence information in the ens::ProgressBar callback. Once that's done, see if similar changes might be useful for some of the other callbacks that print information (see http://ensmallen.org/docs.html#callback-documentation for more information on the callbacks that are available).

rcurtin avatar Dec 24 '19 02:12 rcurtin

I would like to take up this issue.

vikram2000b avatar Jan 10 '20 07:01 vikram2000b

Hi @vikram2000b, you're more than welcome to. When you have a PR ready, someone will review it when they're able to. :+1:

rcurtin avatar Jan 10 '20 14:01 rcurtin

Can I take up this issue as there is no PR yet? I am a beginner and I want to explore how this works.

dB2510 avatar Jan 26 '20 05:01 dB2510

@dB2510 I think that is reasonable, @vikram2000b are you fine with that?

zoq avatar Jan 26 '20 19:01 zoq

@zoq Any idea how can I reproduce the issue mentioned here https://github.com/mlpack/mlpack/issues/2073 ?

dB2510 avatar Jan 28 '20 17:01 dB2510

@dB2510 there is some code in that issue that you can directly compile and use in a program to reproduce the issue. You will probably have to find some datasets to use to reproduce the issue, but that shouldn't be hard---you can probably use some of the datasets in src/mlpack/tests/data/ or something like that.

rcurtin avatar Jan 30 '20 02:01 rcurtin

Hey, @rcurtin wanted to know if this has been solved if not then can I move forward with it?

gaurav-singh1998 avatar Feb 23 '20 16:02 gaurav-singh1998

@gaurav-singh1998 I have yet to see any PR related to the issue, so yes, if you can solve the issue, please do. :+1:

rcurtin avatar Feb 23 '20 20:02 rcurtin

Hi, @rcurtin @zoq while I was through the callback codebase trying to understand it for solving this issue I came across the piece of code given below which iterates over the callbacks and invoke the BeginOptimization() callback if it exists

bool result = false;
    (void)std::initializer_list<bool>{ result =
        result || Callback::BeginOptimizationFunction(callbacks, optimizer,
            function, coordinates)... };
     return result;

I am having a hard time understanding how does this work. Please help understand the underlying concepts behind this. Thanks.

gaurav-singh1998 avatar Mar 08 '20 15:03 gaurav-singh1998

@rcurtin @zoq one more thing, sorry if I irritate you but your help would be very helpful for moving forward with this issue. According to the discussion which happened here, the optimization terminates partway because of the overallObjective value reaching nan or inf or because of std::abs(lastObjective - overallObjective) < tolerance or because of EarlyStopAtMinLoss. This information has to be incorporated in ens::ProgressBar. Am I right? Once again I am sorry to disturb you with these trivial doubts. Thanks

gaurav-singh1998 avatar Mar 08 '20 17:03 gaurav-singh1998

Hey @gaurav-singh1998, the bit of code in BeginOptimization() terminates the optimization after a callback returns false. We actually just submitted a paper to arXiv today that describes the details; you can find the sources in https://www.github.com/mlpack/ensmallen_tr/, and it will be up on arXiv soon.

I would suggest finding a way to make it so that ProgressBar can realize when optimization has terminated without actually adding all of the convergence checks to the internals of ProgressBar.

rcurtin avatar Mar 09 '20 13:03 rcurtin

Thanks, @rcurtin for the clarification I will try to solve this issue in the next few days.

gaurav-singh1998 avatar Mar 09 '20 16:03 gaurav-singh1998

Hi, @rcurtin while I was going through the ProgressBar code I found that the number of epochs is defined as epochs = optimizer.MaxIterations() / epochSize but shouldn't it be epochs = optimizer.MaxIterations() / function.NumFunctions(). I am saying this because if we were to consider, as an example, suppose there are 120000 data points in a dataset and MaxIterations() parameter of the optimizer is defined as 240000 i.e. double the number of data points in the dataset then according to the equation that I stated epochs will turn out to be 2 which should be correct according to me because an epoch is a single pass through the full training set and here we are supposed to take 2 passes through the dataset. Correct me if I am wrong. Thanks.

gaurav-singh1998 avatar Mar 10 '20 19:03 gaurav-singh1998

epochSize is defined as function.NumFunctions() * optimizer.BatchSize(). Thus effectively epochSize is equivalent to the number of times the StepTaken() callback is called.

It does seem to me like epochs should then be defined to account for the value of optimizer.BatchSize(), so perhaps there is an issue there.

However, I'm not familiar enough with the code to say for sure. Before making any changes, I would suggest first confirming that this is an issue by creating an example where ProgressBar does not behave as expected, and then open a separate issue in this repository to handle it.

rcurtin avatar Mar 11 '20 12:03 rcurtin

Hi, @rcurtin @zoq @shrit according to the discussion that happened here I recreated the issue, the code regarding it can be found here. The dimensions of the dataset I used was [120000, 23] (considering the row-major matrix). After compiling and executing it the output I got was,

Epoch 1/54
1563/1875 [===================================================================================>................] 83% - ETA: 0s - loss: 249.826

As it can be seen that not a single epoch was completed by the training process and according to me this is happening only because of the entire dataset not is being read by the optimizer. I am saying this because,

  • After inspecting the source code of ens::EarlyStopAtMinLoss() I can say that it can only abort the training procedure after checking if the bestObjective is achieved at the end of every epoch and here not a single epoch is being completed.
  • Similarly std::isnan(overallObjective) || std::isinf(overallObjective) and std::abs(lastObjective - overallObjective) < tolerance i.e. non-availability of overallObjective and tolerance checking can also not abort the training procedure because if we are to inspect the source code of sgd_impl.hpp these conditions only get checked if (currentFunction % numFunctions) == 0 is true and it can only be true if an epoch gets completed which is not the case here.

So, I don't think that the optimizer is reaching convergence to create this issue it is only happening because the optimizer is not parsing the entire dataset. I have tried to solve this issue here (Kindly review it).

This being mentioned I think there are some issues in progress_bar.hpp which needs attention. One of them is the number of epochs not being correctly set as I mentioned above and the ETA value being stuck at zero during the training process. I will open a different issue regarding these. Kindly correct me if I am wrong somewhere. Thanks.

gaurav-singh1998 avatar Mar 11 '20 18:03 gaurav-singh1998

@gaurav-singh1998 I do not think that the optimizer has converged in your case, but this is only because you have set optimizer.Tolerance() = -1. So the optimizer will never converge in this case. I will try to clarify things as I understand. Actually, the objective of this issue is to give us a better output only in the case when the optimizer converges. Therefore, It might be a good idea to increase the optimizer.Tolerance() and to wait until the optimizer converges. This output will allow us to understand why the training is terminating partway, but since we do not have any output here we can not understand the reason behind this stopping. Since the number of iteration in your code is less than the number of data points, it is normal for optimization to stop before the end of the epoch. (I am not sure if you have used your pull request https://github.com/mlpack/mlpack/pull/2238 in this case, but surely it should print the warning message)

I am not familiar with progress_bar.hpp. So I do not know if the epochs number is calculated correctly, Also I have noticed several times ETA=0 during the progress of the bar, this might be an issue also. Thank you for bringing it here! :+1:

shrit avatar Mar 11 '20 20:03 shrit

Hi, @shrit thanks for commenting as suggested by you I changed the optimizer.Tolerance() value from -1 to be 120596 and let the optimizer run for three epochs until it got converged this was the output,

Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 7s 36ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 4s 34ms/step - loss: 0.523849

As it can be seen that there is no message that states the optimizer has converged which should've got printed according to this but when I changed Info to std::cout I got the expected result i.e.

Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 0.523849
SGD: minimized within tolerance 120596; terminating optimization.

This is weird because I think that the error lies in ens::Info. @rcurtin @zoq what do you think about this?

(I am not sure if you have used your pull request mlpack/mlpack#2238 in this case, but surely it should print the warning message)

Yes I used it and the warning message gets printed as expected. Kindly let me know if I am wrong somewhere. Thanks.

gaurav-singh1998 avatar Mar 11 '20 22:03 gaurav-singh1998

@gaurav-singh1998 I do not remember if ens::info macros are ON by default in cmake. If not maybe you need to add option(ENS_PRINT_INFO ON) in your CMakeList file

shrit avatar Mar 11 '20 22:03 shrit

Hi, @shrit sorry to respond late but when I made the changes in this line to make the ENS_PRINT_INFO defined as suggested by @zoq on Gitter I got a more enriched result i.e.

Maximum number of iterations not defined (no limit), no progress bar shown.
Epoch 1/0
208.98275 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 208.982
SGD: iteration 120000, objective 2.50779e+07.
Epoch 2/0
1.4704875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 1.47048
SGD: iteration 240000, objective 176458.
Epoch 3/0
0.5238495 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 0.523849
SGD: iteration 360000, objective 62861.9.
SGD: minimized within tolerance 120596; terminating optimization.

Although the first line, Maximum number of iterations not defined (no limit), no progress bar shown., shouldn't be printed according to me but other than that everything looks fine to me. What do you think about this?

gaurav-singh1998 avatar Mar 13 '20 19:03 gaurav-singh1998

Hi, @shrit kindly have a look at #180. Thanks.

gaurav-singh1998 avatar Mar 15 '20 18:03 gaurav-singh1998