Dustin Tran
Dustin Tran
For some idea of what the visuals look like, I quite like the stuff in http://arxiv.org/abs/1206.1106 for example (not particularly, just an arbitrary paper I chose). i.e. high resolution fonts,...
picture of current progress  bugs/things to continue working on: - sgd gives nonsensical prediction results (could be a result of bad learning...
Progress:  1. was definitely just a problem of setting the hyperparameter `alpha` in the Xu's learning rate. This also still needs to...
I generalized the `d`-dimensional learning rate `D_n` to have hyperparameters `α` and `c`: ``` I_hat = α*I_hat + diag(I_hat_new) D_n = 1/(I_hat)^c ``` The observed Fisher information `I_hat` is the...
Yup. ``` R library(sgd) # Dimensions N
I've been trying to dig into the theory and am thoroughly perplexed. The paper looks at minimizing the regret function using the Mahalanobis norm, which generalizes L2. That is, we...
Yup, would definitely be interesting to see. That is, we check the variance of the two estimates as `n -> infty` through a plot
As a reminder (to self), this was looked at and briefly mentioned in the current draft for the NIPS submission. The intuition behind why AdaGrad leads to better empirical performance...
You can look at the method of moments example in the repo. It implements a gradient function which is passed into SGD. This can be useful for simple prototyping, bu...
The current implementation for the Cox model uses it. It's not worth the effort yet to code up general classes of models where this IRLS+SGD idea would work—at least not...