dask-glm WIP: ENH: implement gradient approximation

Optimization can take less time is the gradient is approximated using a subset of the examples. The approach in detailed in "Hybrid deterministic-stochastic methods for data fitting", but more examples are needed to approximate the gradient as the solution is approached. This PR is motivated by "Gradient diversity empowers distributed machine learning".

This approximates the hybrid approach described above in section 4 (with the batch size growing as in section 5). In section 4, they describe using an Armijo line search and using a scaled direction via L-BFGS. This implementation uses the line search but uses gradient descent.

I implemented this in the function gradient_descent (which they describe). In the paper, they apply their scaling to L-BFGS and Newton (see section 4.1 and equation 1.6).

Timing results:

Pointwise loss	Relative error

Time shown in seconds. Pointwise loss the objective functions, and relative error is LA.norm(beta_hat - beta_star) / LA.norm(beta_star), which is perhaps a more true measure of convergence.

Work to do:

[ ] Look into approximating gradient in other optimization algs
[ ] Profile and optimize
[ ] tune defaults; how should the batch size grow, and what should the initial batch size be?

Gist of timing notebook: https://gist.github.com/stsievert/31f34ee76d7c31bd68acb96ebe8f11ef

Sep 01 '17 20:09 stsievert

I believe re-chunking X is an issue because the batch size grows linearly and the graph below:

I believe the code responsible for this is:

i = np.random.permutation(n).astype(int)
batch = list(i[:int(batch_size)])
X, y = keep['X'][batch], keep['y'][batch]

Is there any way to optimize this?

Sep 01 '17 20:09 stsievert

My first guess was that we were building onto the same graph over and over again, such as would be the case if we kept modifying X in place

for ...
    X = X[...]

But we don't seem to be doing this.

The next thing I would try would be profiling using the single-threaded scheduler and a tool like snakeviz. Learning what is taking up all of the time would probably be pretty informative. If this isn't familiar to you then you might want to look at @jcrist's SciPy 2017 talk on profiling in Dask.

I can probably take a closer look at this early next week. I'm excited to see work in this direction.

Sep 01 '17 21:09 mrocklin

Turns out Array.__getitem__ is the bottleneck, taking about 75% of the time when the batch size is large (at least if I'm interpreting snakeviz right). This remains significant even while indexing with NumPy or dask arrays, and most of the time is spent at slicing.py#L58.

Indexing is required to form the batch randomly each iteration. This method relies on X[batch] when batch is a set of random indices, or something that produces the same result.

It might be faster to index with slices after randomly reshuffling. That is, something like

for k in iterations:
    if k % reshuffle_rate == 0:
        X_keep = shuffle(X_keep)  # likewise for y
    i = np.random.choice(n - batch_size)
    X = X_keep[i:i + batch_size]  # likewise for y

If reshuffle_rate > 1, this does violate an assumption used to prove the convergence of mini batch SGD (samples are not drawn iid). I don't think this would be an issue practically though. I'll try to implement this in the next couple days.

I also tried using masked arrays from https://github.com/dask/dask/pull/2301, but they're not the right tool: we will still have to index with some boolean array to get the batch with X[~X.mask].

look at @jcrist's SciPy 2017 talk on profiling in Dask

D'oh – I was even at that talk!

Sep 02 '17 13:09 stsievert

I think that ideally we would resolve this with something like the following:

In [1]: import dask.array as da

In [2]: X = da.ones((100, 5), chunks=(10, 5))

In [3]: y = da.ones((100,), chunks=(10,))

In [4]: index = da.random.random(size=y.shape, chunks=y.chunks) > 0.9

In [5]: X2 = X[index, :]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-9e9ddb5053ea> in <module>()
----> 1 X2 = X[index, :]

/home/mrocklin/workspace/dask/dask/array/core.py in __getitem__(self, index)
   1205 
   1206         if any(isinstance(i, Array) for i in index):
-> 1207             raise NotImplementedError("Indexing with a dask Array")
   1208 
   1209         from .slicing import normalize_index

NotImplementedError: Indexing with a dask Array

Unfortunately, as you can see, this isn't implemented in dask.array (but this could be changed easily).

I think that this will appropriately avoid the costs that you're seeing and will also be somewhat intuitive.

Sep 02 '17 13:09 mrocklin

I'll come up with a temporary workaround

Sep 02 '17 13:09 mrocklin

We can bypass some of this by just operating on each chunk independently. This example might be of use.

 In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: X = da.ones((100, 5), chunks=(10, 5))

In [4]: y = da.ones((100,), chunks=(10,))

In [5]: def downsample(chunk, frac):
   ...:     ind = np.random.random(size=len(chunk)) < frac
   ...:     return chunk[ind, ...]
   ...: 

In [6]: X2 = X.map_blocks(downsample, frac=0.1, chunks=(np.nan, X.shape[1]), dtype=X.dtype)

In [7]: y2 = y.map_blocks(downsample, frac=0.1, chunks=(np.nan,), dtype=y.dtype)

In [8]: y2.compute()
Out[8]: array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

Looking again at your implementation with np.random.permutation though I now realize that I may not fully understand what your objectives are. Are we downsampling here or is your desired transformation more complex?

Sep 02 '17 13:09 mrocklin

X.map_blocks looks perfect. One small issue is that the batch size is deterministic (not random), but this shouldn't be a huge blocker.

Are we downsampling here or is your desired transformation more complex?

We are downsampling the array (not with some probability but to a known size). The implementation is old; I've pushed some more commits. These implement the idea in https://github.com/dask/dask-glm/pull/60#issuecomment-326743830

Sep 02 '17 14:09 stsievert

I am running into an issue with the line search; I've had cases where it stops too early. I'm going to investigate this more. In Numerical Optimization, the describe this same backwards line search and say

Inequality (3.4) [the same condition used in dask-glm.compute_stepsize] is sometimes called the Armijo condition ... In practice, c1 [aka armijoMult ] is chosen to be quite small, say c1 = 10^-4.

Currently armijoMult == 0.1, and setting armijoMult == 1e-4 didn't effect speed in my test case (though I'm not sure how much I trust my test). It looks like @moody-marlin originally implemented this in bfcf708cb29ed050570088901d92fa762f605ca7 -- could you detail your choice of armijoMult == 0.1?

Sep 06 '17 17:09 stsievert

@stsievert -- so I don't remember that choice being particularly special. I also started with that book and used 1e-4, but don't remember a specific reason for changing it. Did you experiment with both Logistic and Normal models? Could be that the effect is larger for one vs the other.

Sep 15 '17 14:09 cicdw

dask-glm dask-glm copied to clipboard

WIP: ENH: implement gradient approximation

dask-glm
dask-glm copied to clipboard