mlcourse
mlcourse copied to clipboard
Tie Estimation Error to Variance
Given a sample, we get an estimator in the hypothesis space. The performance gap between the estimator and the best in the space is the estimation error. The estimator is a random function, so if we repeat the procedure on a new training set, we will end up with a new estimator. Show a different point for each new batch of data, clustering around the optimal. If we take a larger training set, the variance of those points should decrease. I don't know of a precise measure of this "variance". But if I draw it this way, need to point out that this is just a cartoon, in which points in the space correspond to prediction functions, and closer points correspond to prediction functions that have more similar predictions (say in L2 norm for score functions, or probability of difference for classifications).
Probably of relevance here is Pedro Domingos's paper on generalizing bias-variance decompositions beyond the square loss: http://homes.cs.washington.edu/~pedrod/bvd.pdf
I thought about this back when I watched the videos. For parametric estimators, you can talk about your uncertainty in the parameter values (I made a concept-check question about the covariance matrix of \hat{w} for least squares linear regression, and ridge regression). In general, I think L2 methods are the way, but I don't have a reference.
Hi David,
What do you think about the following visualizations for the excess risk decomposition?
-
Decision tree -- expand on the classification problem from the slides: (a) 2D plots similar to p 30, 31 of your slides Excess Risk Decomposition with a few different sample sizes: We can plot the depth of the tree on the x-axis and the error on the y-axis, decomposed into estimation, approximation and optimization errors by colored bar. (b) Also 3D plots showing the depth on the x-axis, the sample size on the y-axis and the error on the z-axis, decomposed into 3 surfaces representing estimation, approximation and optimization errors.
-
Linear model -- y(x) = a+bx_1+cx_2 where x = (x_1,x_2). We sample from a distribution y(x)=w_0+w_1 x+ \epsilon where \epsilon \sim N(0, 2^2). We plot the clustering of w = (a, b, c) representing the estimation error for different sample sizes.
-
Ridge regression -- We can plot the error vs complexity. Do you have a particular distribution to sample from in mind?
Thank you.
Best, Vlad
Good evening David,
-
I posted a 2D animation for GD with fixed step at https://github.com/davidrosenberg/mlcourse-homework/blob/master/in-prep/recitations/gd_fixed_step_2d.ipynb Please let me know if this is what you had in mind. I will overlay the other gradient descent methods we discussed tomorrow (Friday).
-
For the demo of the distribution of minibatch SGD directions, are you OK if we use a ridge regression model and sample from a linear model with additive Gaussian noise? Also did you have any particular step size in mind, e.g., 1/n?
Thank you very much.
Best, Vlad
Hi Vlad -- 2d animation looks good. for minibatch SGD, what about just linear regression (no ridge penalty)? Linear model with additive Gaussian noise sounds fine. Let's start with a fixed step size. (i.e. a fixed multiplier of the minibatch gradient)
David, Thank you! I think that we need the step size eta_t to converge to zero for the minibatch method to converge. When I run the minibatch code with fixed step size, it doesn't converge. It does converge when I try 1/t -- are you OK with this or perhaps I am misunderstanding something.
I think it's fine.
Sent from my iPhone
On Jan 22, 2017, at 5:51 PM, vakobzar [email protected] wrote:
David, Thank you! I think that we need the step size eta_t to converge to zero for the minibatch method to converge. When I run the minibatch code with fixed step size, it doesn't converge. It does converge when I try 1/t -- are you OK with this or perhaps I am misunderstanding something.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.