deep-rules icon indicating copy to clipboard operation
deep-rules copied to clipboard

When it comes to data, size matters.

Open Benjamin-Lee opened this issue 5 years ago • 5 comments

Benjamin-Lee avatar Oct 24 '18 20:10 Benjamin-Lee

Also the strength of the structure in the data, quality of the labels, noise in the outcome that is due to stochasticity in the biology, etc. Can we just say "data matters a lot" or something and then dive into details there?

cgreene avatar Nov 01 '18 15:11 cgreene

I'm not sure how to articulate this, but dataset size is relative to the complexity of the problem. We've had problems where the input-output mapping is straightforward enough to learn with 100s of instances. In other cases, ~1 million instances aren't sufficient because the domain is so complex.

That's essentially a rephrasing of Casey's strength of the structure in the data

agitter avatar Nov 01 '18 17:11 agitter

Maybe something along the lines of The more complex the domain, the more data are required as the title of the rule?

Benjamin-Lee avatar Nov 01 '18 18:11 Benjamin-Lee

I frequently get comments from investigators along the lines of "that's true for statistical power... which is why we want to use an AI technique". So for example I'll suggest a single-layer perceptron with softmax activation and, if I get a positive response, note that this is mathematically identical to logistic regression, for which we can easily perform a power analysis. The reaction is, as you might expect, not favorable; the investigators are seeking magic AI fairy dust. Ben Recht, who knows a thing or two about both AI and regularization, has a wonderful piece where he argues that optimizing a neural net fit is trivial compared to regularizing it (i.e., fitting data is easy, but having that model fit someone else's data can be very hard).

http://www.argmin.net/2016/04/18/bottoming-out/

Fitting a nonlinear model with an NN (a universal function approximator, given sufficient depth) is not hard; keeping the model from overfitting is horrendously hard. Ben Recht has made this point before, and I'm looking for a succinct way to communicate it, once and for all. If the only thing this paper accomplished was to establish that (sample) size matters, in AI as elsewhere, that would be enough.

I would have thought that the immortal "more data beats better models" would do the trick, but somehow there are investigators who doubt that (e.g.) Google knows what they're doing, and therefore the magical AI fairy will work on their tiny dataset. Maybe. But a general demonstration of where this is true (if indeed there are any good examples) and where it is demonstrably not (e.g. Google Flu?) would go a long way.

ttriche avatar Nov 01 '18 18:11 ttriche

Should this be connected to #3? I'm thinking in terms of "make sure your training set is big enough" and "DL will typically need larger training sets given that more parameters need to be fitted, although there are methods that perform implicit or explicit regularization, which may help a bit."

SiminaB avatar Nov 01 '18 18:11 SiminaB