dask-glm Question about convergence on repetitive data

This is a question for @moody-marlin and @mcg1969

I expect that many large datasets will be highly repetitive. That is that I expect a sample of the data to be decently representative of the full dataset. Given this structure, it feels inefficient for our iterative algorithms to go over the entire dataset before updating parameters.

Are there situations in which it is advantageous to cycle through different subsets of the full dataset?

As a naive (probably wrong) example perhaps we would partition the data into ten randomly selected subsets, and then perform a single round of gradient descent on each in turn to obtain a new gradient direction.

Feb 25 '17 14:02 mrocklin

This is the insight that randomized algorithms like stochastic gradient descent take advantage of; so, in theory you could compute the gradient on a randomly selected chunk and use that to update the coefficients at each iteration. I'm not convinced this would work well with only 10 chunks, but we could also take random subsets of data from a given chunk to increase the randomness. I do think there is an appetite for randomized algorithms, and with the current code structure + how easy dask makes it to get a list of chunks, I think this would be fairly easy to write up a simple implementation.

Related, what are the costs of rechunking large datasets every 10-20 iterations?

Mar 01 '17 15:03 cicdw

If you want to do a full shuffle then the answer is "somewhat expensive". If you want to split per-chunk then the answer is "pretty cheap"

We can trivially cycle through random chunks. This probably violates iid assumptions though in the common case. We can easily produce 10 1/10th size datasets by splitting each chunk into 10 and treating those splits as 10 dask.arrays. This has the cost of increasing the partition count by 10x, but that's usually ok (maybe not though if we're running into scheduling overheads (which we are)). Full shuffles are fairly expensive, I might be overly pessimistic here.

Mar 01 '17 15:03 mrocklin

@moody-marlin for admm, does it matter if data is sorted or randomized?

Mar 14 '17 02:03 hussainsultan

I suspect that randomization matters more if you consider only batches or samples at a time.

Mar 14 '17 02:03 mrocklin

@hussainsultan yea, randomization is best -- sorted is probably not a good idea (example: imagine one of the local updates tries to fit a logistic regression on data that has no observed 1 event).

Mar 14 '17 14:03 cicdw

ADMM will still converge if the data is not randomized but take a long time, is that what you meant by not a good idea?

Mar 14 '17 17:03 hussainsultan

dask-glm dask-glm copied to clipboard

Question about convergence on repetitive data

dask-glm
dask-glm copied to clipboard