dask-glm icon indicating copy to clipboard operation
dask-glm copied to clipboard

Question about convergence on repetitive data

Open mrocklin opened this issue 8 years ago • 6 comments

This is a question for @moody-marlin and @mcg1969

I expect that many large datasets will be highly repetitive. That is that I expect a sample of the data to be decently representative of the full dataset. Given this structure, it feels inefficient for our iterative algorithms to go over the entire dataset before updating parameters.

Are there situations in which it is advantageous to cycle through different subsets of the full dataset?

As a naive (probably wrong) example perhaps we would partition the data into ten randomly selected subsets, and then perform a single round of gradient descent on each in turn to obtain a new gradient direction.

mrocklin avatar Feb 25 '17 14:02 mrocklin

This is the insight that randomized algorithms like stochastic gradient descent take advantage of; so, in theory you could compute the gradient on a randomly selected chunk and use that to update the coefficients at each iteration. I'm not convinced this would work well with only 10 chunks, but we could also take random subsets of data from a given chunk to increase the randomness. I do think there is an appetite for randomized algorithms, and with the current code structure + how easy dask makes it to get a list of chunks, I think this would be fairly easy to write up a simple implementation.

Related, what are the costs of rechunking large datasets every 10-20 iterations?

cicdw avatar Mar 01 '17 15:03 cicdw

If you want to do a full shuffle then the answer is "somewhat expensive". If you want to split per-chunk then the answer is "pretty cheap"

We can trivially cycle through random chunks. This probably violates iid assumptions though in the common case. We can easily produce 10 1/10th size datasets by splitting each chunk into 10 and treating those splits as 10 dask.arrays. This has the cost of increasing the partition count by 10x, but that's usually ok (maybe not though if we're running into scheduling overheads (which we are)). Full shuffles are fairly expensive, I might be overly pessimistic here.

mrocklin avatar Mar 01 '17 15:03 mrocklin

@moody-marlin for admm, does it matter if data is sorted or randomized?

hussainsultan avatar Mar 14 '17 02:03 hussainsultan

I suspect that randomization matters more if you consider only batches or samples at a time.

mrocklin avatar Mar 14 '17 02:03 mrocklin

@hussainsultan yea, randomization is best -- sorted is probably not a good idea (example: imagine one of the local updates tries to fit a logistic regression on data that has no observed 1 event).

cicdw avatar Mar 14 '17 14:03 cicdw

ADMM will still converge if the data is not randomized but take a long time, is that what you meant by not a good idea?

hussainsultan avatar Mar 14 '17 17:03 hussainsultan