dask-glm
dask-glm copied to clipboard
WIP: ENH: implement gradient approximation
Optimization can take less time is the gradient is approximated using a subset of the examples. The approach in detailed in "Hybrid deterministic-stochastic methods for data fitting", but more examples are needed to approximate the gradient as the solution is approached. This PR is motivated by "Gradient diversity empowers distributed machine learning".
This approximates the hybrid approach described above in section 4 (with the batch size growing as in section 5). In section 4, they describe using an Armijo line search and using a scaled direction via L-BFGS. This implementation uses the line search but uses gradient descent.
I implemented this in the function gradient_descent
(which they describe). In the paper, they apply their scaling to L-BFGS and Newton (see section 4.1 and equation 1.6).
Timing results:
Pointwise loss | Relative error |
---|---|
![]() |
![]() |
Time shown in seconds. Pointwise loss the objective functions, and relative error is LA.norm(beta_hat - beta_star) / LA.norm(beta_star)
, which is perhaps a more true measure of convergence.
Work to do:
- [ ] Look into approximating gradient in other optimization algs
- [ ] Profile and optimize
- [ ] tune defaults; how should the batch size grow, and what should the initial batch size be?
Gist of timing notebook: https://gist.github.com/stsievert/31f34ee76d7c31bd68acb96ebe8f11ef
I believe re-chunking X
is an issue because the batch size grows linearly and the graph below:

I believe the code responsible for this is:
i = np.random.permutation(n).astype(int)
batch = list(i[:int(batch_size)])
X, y = keep['X'][batch], keep['y'][batch]
Is there any way to optimize this?
My first guess was that we were building onto the same graph over and over again, such as would be the case if we kept modifying X in place
for ...
X = X[...]
But we don't seem to be doing this.
The next thing I would try would be profiling using the single-threaded scheduler and a tool like snakeviz. Learning what is taking up all of the time would probably be pretty informative. If this isn't familiar to you then you might want to look at @jcrist's SciPy 2017 talk on profiling in Dask.
I can probably take a closer look at this early next week. I'm excited to see work in this direction.
Turns out Array.__getitem__
is the bottleneck, taking about 75% of the time when the batch size is large (at least if I'm interpreting snakeviz right). This remains significant even while indexing with NumPy or dask arrays, and most of the time is spent at slicing.py#L58.
Indexing is required to form the batch randomly each iteration. This method relies on X[batch]
when batch
is a set of random indices, or something that produces the same result.
It might be faster to index with slices after randomly reshuffling. That is, something like
for k in iterations:
if k % reshuffle_rate == 0:
X_keep = shuffle(X_keep) # likewise for y
i = np.random.choice(n - batch_size)
X = X_keep[i:i + batch_size] # likewise for y
If reshuffle_rate > 1
, this does violate an assumption used to prove the convergence of mini batch SGD (samples are not drawn iid). I don't think this would be an issue practically though. I'll try to implement this in the next couple days.
I also tried using masked arrays from https://github.com/dask/dask/pull/2301, but they're not the right tool: we will still have to index with some boolean array to get the batch with X[~X.mask]
.
look at @jcrist's SciPy 2017 talk on profiling in Dask
D'oh – I was even at that talk!
I think that ideally we would resolve this with something like the following:
In [1]: import dask.array as da
In [2]: X = da.ones((100, 5), chunks=(10, 5))
In [3]: y = da.ones((100,), chunks=(10,))
In [4]: index = da.random.random(size=y.shape, chunks=y.chunks) > 0.9
In [5]: X2 = X[index, :]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-5-9e9ddb5053ea> in <module>()
----> 1 X2 = X[index, :]
/home/mrocklin/workspace/dask/dask/array/core.py in __getitem__(self, index)
1205
1206 if any(isinstance(i, Array) for i in index):
-> 1207 raise NotImplementedError("Indexing with a dask Array")
1208
1209 from .slicing import normalize_index
NotImplementedError: Indexing with a dask Array
Unfortunately, as you can see, this isn't implemented in dask.array (but this could be changed easily).
I think that this will appropriately avoid the costs that you're seeing and will also be somewhat intuitive.
I'll come up with a temporary workaround
We can bypass some of this by just operating on each chunk independently. This example might be of use.
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: X = da.ones((100, 5), chunks=(10, 5))
In [4]: y = da.ones((100,), chunks=(10,))
In [5]: def downsample(chunk, frac):
...: ind = np.random.random(size=len(chunk)) < frac
...: return chunk[ind, ...]
...:
In [6]: X2 = X.map_blocks(downsample, frac=0.1, chunks=(np.nan, X.shape[1]), dtype=X.dtype)
In [7]: y2 = y.map_blocks(downsample, frac=0.1, chunks=(np.nan,), dtype=y.dtype)
In [8]: y2.compute()
Out[8]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Looking again at your implementation with np.random.permutation
though I now realize that I may not fully understand what your objectives are. Are we downsampling here or is your desired transformation more complex?
X.map_blocks
looks perfect. One small issue is that the batch size is deterministic (not random), but this shouldn't be a huge blocker.
Are we downsampling here or is your desired transformation more complex?
We are downsampling the array (not with some probability but to a known size). The implementation is old; I've pushed some more commits. These implement the idea in https://github.com/dask/dask-glm/pull/60#issuecomment-326743830
I am running into an issue with the line search; I've had cases where it stops too early. I'm going to investigate this more. In Numerical Optimization, the describe this same backwards line search and say
Inequality (3.4) [the same condition used in dask-glm.compute_stepsize] is sometimes called the Armijo condition ... In practice, c1 [aka armijoMult ] is chosen to be quite small, say c1 = 10-4.
Currently armijoMult == 0.1
, and setting armijoMult == 1e-4
didn't effect speed in my test case (though I'm not sure how much I trust my test). It looks like @moody-marlin originally implemented this in bfcf708cb29ed050570088901d92fa762f605ca7 -- could you detail your choice of armijoMult == 0.1
?
@stsievert -- so I don't remember that choice being particularly special. I also started with that book and used 1e-4, but don't remember a specific reason for changing it. Did you experiment with both Logistic and Normal models? Could be that the effect is larger for one vs the other.