ensmallen
ensmallen copied to clipboard
Better output when convergence check succeeds
Issue description
When using callbacks like ens::ProgessBar
or similar, and the optimization terminates partway through an epoch due to a convergence check, the output can be confusing:
Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75
but then output stops.
See the discussion at https://github.com/mlpack/mlpack/issues/2073 for many more details and how to reproduce.
Expected behavior
We should see if we can modify the ens::ProgressBar
callback (and perhaps others) to give better output in these situations. Perhaps something like this would be an improvement:
Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75
Optimization converged! (value 1.33e-6, tolerance 1e-5)
Actual behavior
Instead the output just terminates:
Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75
This would be a good issue if you are not familiar with how ensmallen's callbacks work and would like to be. I don't have a direct route to a solution here, so my suggestion would be to investigate the current callback code, reproduce the issue, then think about the cleanest way to print convergence information in the ens::ProgressBar
callback. Once that's done, see if similar changes might be useful for some of the other callbacks that print information (see http://ensmallen.org/docs.html#callback-documentation for more information on the callbacks that are available).
I would like to take up this issue.
Hi @vikram2000b, you're more than welcome to. When you have a PR ready, someone will review it when they're able to. :+1:
Can I take up this issue as there is no PR yet? I am a beginner and I want to explore how this works.
@dB2510 I think that is reasonable, @vikram2000b are you fine with that?
@zoq Any idea how can I reproduce the issue mentioned here https://github.com/mlpack/mlpack/issues/2073 ?
@dB2510 there is some code in that issue that you can directly compile and use in a program to reproduce the issue. You will probably have to find some datasets to use to reproduce the issue, but that shouldn't be hard---you can probably use some of the datasets in src/mlpack/tests/data/
or something like that.
Hey, @rcurtin wanted to know if this has been solved if not then can I move forward with it?
@gaurav-singh1998 I have yet to see any PR related to the issue, so yes, if you can solve the issue, please do. :+1:
Hi, @rcurtin @zoq while I was through the callback codebase trying to understand it for solving this issue I came across the piece of code given below which iterates over the callbacks and invoke the BeginOptimization()
callback if it exists
bool result = false;
(void)std::initializer_list<bool>{ result =
result || Callback::BeginOptimizationFunction(callbacks, optimizer,
function, coordinates)... };
return result;
I am having a hard time understanding how does this work. Please help understand the underlying concepts behind this. Thanks.
@rcurtin @zoq one more thing, sorry if I irritate you but your help would be very helpful for moving forward with this issue. According to the discussion which happened here, the optimization terminates partway because of the overallObjective
value reaching nan
or inf
or because of std::abs(lastObjective - overallObjective) < tolerance
or because of EarlyStopAtMinLoss
. This information has to be incorporated in ens::ProgressBar
. Am I right? Once again I am sorry to disturb you with these trivial doubts. Thanks
Hey @gaurav-singh1998, the bit of code in BeginOptimization()
terminates the optimization after a callback returns false
. We actually just submitted a paper to arXiv today that describes the details; you can find the sources in https://www.github.com/mlpack/ensmallen_tr/, and it will be up on arXiv soon.
I would suggest finding a way to make it so that ProgressBar
can realize when optimization has terminated without actually adding all of the convergence checks to the internals of ProgressBar
.
Thanks, @rcurtin for the clarification I will try to solve this issue in the next few days.
Hi, @rcurtin while I was going through the ProgressBar
code I found that the number of epochs is defined as epochs = optimizer.MaxIterations() / epochSize
but shouldn't it be epochs = optimizer.MaxIterations() / function.NumFunctions()
. I am saying this because if we were to consider, as an example, suppose there are 120000
data points in a dataset and MaxIterations()
parameter of the optimizer is defined as 240000
i.e. double the number of data points in the dataset then according to the equation that I stated epochs will turn out to be 2
which should be correct according to me because an epoch is a single pass through the full training set and here we are supposed to take 2
passes through the dataset. Correct me if I am wrong. Thanks.
epochSize
is defined as function.NumFunctions() * optimizer.BatchSize()
. Thus effectively epochSize
is equivalent to the number of times the StepTaken()
callback is called.
It does seem to me like epochs
should then be defined to account for the value of optimizer.BatchSize()
, so perhaps there is an issue there.
However, I'm not familiar enough with the code to say for sure. Before making any changes, I would suggest first confirming that this is an issue by creating an example where ProgressBar
does not behave as expected, and then open a separate issue in this repository to handle it.
Hi, @rcurtin @zoq @shrit according to the discussion that happened here I recreated the issue, the code regarding it can be found here. The dimensions of the dataset I used was [120000, 23]
(considering the row-major matrix). After compiling and executing it the output I got was,
Epoch 1/54
1563/1875 [===================================================================================>................] 83% - ETA: 0s - loss: 249.826
As it can be seen that not a single epoch was completed by the training process and according to me this is happening only because of the entire dataset not is being read by the optimizer. I am saying this because,
- After inspecting the source code of
ens::EarlyStopAtMinLoss()
I can say that it can only abort the training procedure after checking if thebestObjective
is achieved at the end of every epoch and here not a single epoch is being completed. - Similarly
std::isnan(overallObjective) || std::isinf(overallObjective)
andstd::abs(lastObjective - overallObjective) < tolerance
i.e.non-availability of overallObjective
andtolerance checking
can also not abort the training procedure because if we are to inspect the source code of sgd_impl.hpp these conditions only get checked if(currentFunction % numFunctions) == 0
is true and it can only be true if an epoch gets completed which is not the case here.
So, I don't think that the optimizer is reaching convergence to create this issue it is only happening because the optimizer is not parsing the entire dataset. I have tried to solve this issue here (Kindly review it).
This being mentioned I think there are some issues in progress_bar.hpp which needs attention. One of them is the number of epochs not being correctly set as I mentioned above and the ETA
value being stuck at zero during the training process. I will open a different issue regarding these. Kindly correct me if I am wrong somewhere. Thanks.
@gaurav-singh1998 I do not think that the optimizer has converged in your case, but this is only because you have set optimizer.Tolerance() = -1
. So the optimizer will never converge in this case.
I will try to clarify things as I understand. Actually, the objective of this issue is to give us a better output only in the case when the optimizer converges. Therefore, It might be a good idea to increase the optimizer.Tolerance()
and to wait until the optimizer converges.
This output will allow us to understand why the training is terminating partway, but since we do not have any output here we can not understand the reason behind this stopping.
Since the number of iteration in your code is less than the number of data points, it is normal for optimization to stop before the end of the epoch. (I am not sure if you have used your pull request https://github.com/mlpack/mlpack/pull/2238 in this case, but surely it should print the warning message)
I am not familiar with progress_bar.hpp
. So I do not know if the epochs number is calculated correctly, Also I have noticed several times ETA=0
during the progress of the bar, this might be an issue also. Thank you for bringing it here! :+1:
Hi, @shrit thanks for commenting as suggested by you I changed the optimizer.Tolerance()
value from -1
to be 120596
and let the optimizer run for three epochs until it got converged this was the output,
Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 7s 36ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 4s 34ms/step - loss: 0.523849
As it can be seen that there is no message that states the optimizer has converged which should've got printed according to this but when I changed Info
to std::cout
I got the expected result i.e.
Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 0.523849
SGD: minimized within tolerance 120596; terminating optimization.
This is weird because I think that the error lies in ens::Info
. @rcurtin @zoq what do you think about this?
(I am not sure if you have used your pull request mlpack/mlpack#2238 in this case, but surely it should print the warning message)
Yes I used it and the warning message gets printed as expected. Kindly let me know if I am wrong somewhere. Thanks.
@gaurav-singh1998 I do not remember if ens::info
macros are ON by default in cmake. If not maybe you need to add option(ENS_PRINT_INFO ON)
in your CMakeList file
Hi, @shrit sorry to respond late but when I made the changes in this line to make the ENS_PRINT_INFO
defined as suggested by @zoq on Gitter I got a more enriched result i.e.
Maximum number of iterations not defined (no limit), no progress bar shown.
Epoch 1/0
208.98275 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 208.982
SGD: iteration 120000, objective 2.50779e+07.
Epoch 2/0
1.4704875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 1.47048
SGD: iteration 240000, objective 176458.
Epoch 3/0
0.5238495 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 0.523849
SGD: iteration 360000, objective 62861.9.
SGD: minimized within tolerance 120596; terminating optimization.
Although the first line, Maximum number of iterations not defined (no limit), no progress bar shown.
, shouldn't be printed according to me but other than that everything looks fine to me. What do you think about this?
Hi, @shrit kindly have a look at #180. Thanks.