Latent/reconstruction loss function might have an error
I have recalculated the latent loss function from the M-ELBO under the assumptions made in the original DONUT paper (https://arxiv.org/abs/1802.03903) and got different results, from what you used in donut.py. Perfomance might improve when using the correct latent loss. In the pdf you can find my derivations and detailed explanations.
Furthermore, you don't parametrize the variance in the decoder, as is done in the original DONUT model, which leads to a different reconstruction loss. You can see this by computing the maximum likelihood function of the multivariate Normal distribution, which will then additionally depend on the variance. This is fine in general, but the advantages of modeling the variance is a slightly more expressive model, and the ability to say that the model is more uncertain about some dimensions than other dimensions. The disadvantage of parametrizing the variance is a higher vulnerability to overfitting a small dataset by memorizing the mean perfectly and letting the variance converge to zero, leading to negative infinity loss.
Although there are many more things to say concerning the architecture and type of the autoencoder, I want to bring the following to your attention: Using the logarithmic variance as output of the encoder allows it to capture variances of different scale more efficiently and is common practise. However, notice that the logarithm is (negatively) unbounded for values approaching zero. This causes the ReLU activation functions to saturate and the weights input to these neurons will not be updated anymore during the gradient descent (or possibly a modified version thereof). This problem can be mitigated by using the approach in the original DONUT paper (section 3.1) where an epsilon trick is used.
Thanks a lot for the detailed feedback! I will definitely double-check this.
This problem can be mitigated by using the approach in the original DONUT paper (section 3.1) where an epsilon trick is used.
Would you like to submit a pull-request?
We definitely welcome contributions: fixes, new model types, documentation to ensure that all users can benefit from the updates and enhancements. Free and open-source, MIT license.
Would you like to submit a pull-request?
I am having troubles running all the unit tests. At least one test test_donut.py seems to be faulty, since it does not pass for the unmodified loudml itself. Furthermore, the unit tests are highly non-transparent. Normally, you would aim for testing about 5 different datasets and output the loss for the respective test sets along with the results of (classical) model evaluation. I am therefore unable to evaluate if improvements were made.
May I ask why your autoencoder differs so much from the original DONUT? When I was using the original one, the performance of loudml was easily surpassed. Loudml does not seem to work well with any real data. Why are you for example using one layer less in the encoder and decoder whilst setting the L2 regularization coefficient 10 times as large as done in DONUT? This is severely decreasing the model complexity. Could you shortly elaborate on the changes made to the architecture and why you made them?
Would you like to submit a pull-request?
I am having troubles running all the unit tests. At least one test
test_donut.pyseems to be faulty, since it does not pass for the unmodified loudml itself.
There is variance due to RNG seed values. I'm pushing commits on master to fix this in the coming days.
Furthermore, the unit tests are highly non-transparent. Normally, you would aim for testing about 5 different datasets and output the loss for the respective test sets along with the results of (classical) model evaluation.
Unit tests main intent is to avoid functional regression when the software changes, like upgrading a package dependency, making changes to a function, etc. It's a quality gate that executes in the CI pipeline.
Evaluating against classical model is useful but probably outside the scope of unit tests as intended. But let's do it. It's easy to roll out a VAE baseline model and compare.
May I ask why your autoencoder differs so much from the original DONUT? When I was using the original one, the performance of loudml was easily surpassed. Loudml does not seem to work well with any real data. Why are you for example using one layer less in the encoder and decoder whilst setting the L2 regularization coefficient 10 times as large as done in DONUT? This is severely decreasing the model complexity. Could you shortly elaborate on the changes made to the architecture and why you made them?
Using variance, the softplus layers and epsilon trick is on the todo list. Time is the resource we all need. Help much appreciated, and happy to receive feedback.
There is variance due to RNG seed values.
The weight initialization in keras is done using the glorot uniform initializer by default. The optimizer is quite robust with respect to this initialization, but still it might be the cause. In compute_bucket_scores you use the midvalue as an estimator of the mean, which in turn is not robust, so maybe this is the source of the problem, as it is exactly the "y_low_high" test that fails.
Unit tests main intent is to avoid functional regression when the software changes, like upgrading a package dependency, making changes to a function, etc.
Then I will try to run donut.py as standalone. I just thought it would make sense to have a test for the model performance, since it might depend on the version, too.
But let's do it. It's easy to roll out a VAE baseline model and compare.
I did this, where I chose a dataset in which the first differences were approximately normally distributed (most of the ones I used so far share this property) and just applied a Grubbs outlier test (level of significance = 0.005) on the differences. Note that this approach is very naive, hence serves quite well as a baseline. However, it outperformed loudml on all non-test datasets.
Using variance, the softplus layers and epsilon trick is on the todo list.
This will not take long to implement, once I get the tests running!
Help much appreciated, and happy to receive feedback.
I am very much inclined to help and this is why I call so many thing in question. I simply suspect that the architecture of the autoencoder is suboptimal. So is there a particular reason you changed the architecture for the network (did it perform better or equally well on some data)? I assume you are aware that you work with about half the layers as DONUT, correct? Lastly, have you checked the issue mentioned in my first post (latent loss function)? If not, just link me the paper or article you used to compute the loss and I will check!
Just pushed fresh changes on master and upgraded to TF 1.13.2 at the same time. So if you have a virtualenv you can do pip install base/vendor/requirements.txt to upgrade.
Changes:
- Added the missing layers, set L2 regularization, clip norm, etc, to use the empiric values defined in the original publication
- Used beta as suggested in latent loss calculation
Tip to set the random seed and get reproducible output during development, you can run:
export PYTHONHASHSEED=0 ; export RANDOM_SEED=10 ; nosetests -v -s tests/test_donut.py
If 10 is not working for you, you can pick another seed value. I agree some unit tests are not solid and changing the seed value can easily cause some tests to fail.
Update: the master branch is now packaging the server code. You can find the CLI client on PyPI: named loudml-python
Donut missing layers has been added as well as the default hyper parameters from the original publication. Softmaxs still missing.
More help needed to make progress on this issue? Let me know.