stumpy
stumpy copied to clipboard
Speeding up `scrump`
I have an idea that might speed up scrump when its parameter pre_scrump is set to True. (After some thought, I concluded that this reduction in computing time would be probably insignificant. However, I am going to share my idea here anyway)
When pre_scrump=True, we calculate an approximate matrix profile by going through some samples (indices):
https://github.com/TDAmeritrade/stumpy/blob/c432db42267f3f88698c4dd4cee9128bc01d563f/stumpy/scrump.py#L346
(note: l = n - m + 1)
And, for each sample i:
- we calculate its distance profile.
- we also use QT to calculate the distance between
S_(i + j)andS_(nn_i + j)forj in range(1, s) - we also use QT to calculate the distance between
S_(i - j)andS_(nn_i - j)forj in range(1, s)
So, of Combination(l, 2) cells of distance matrix (in self-join), we roughly calculate:
- $\frac{l}{s} \times l$ (regarding distance profile)
- $\frac{l}{s} \times s$ (regarding move forward)
- $\frac{l}{s} \times s$ (regarding move backward)
So, at the end of prescrump process, we already calculated roughly $\frac{1}{s} l^{2} + 2l$ elements of distance matrix. I think if we assume l is large, we can then say:
- we need to calculate $\frac{1}{2} l^{2}$ cells to obtain the full distance matrix (in self-join)
- At the end of
prescrump, we already calculated $\frac{1}{s} l^{2}$ cells of them.
So, if we avoid re-calculating $\frac{1}{s} l^{2}$ number of distances throughtout .update() process, then:
$$ \frac{\frac{1}{s} l^{2}}{\frac{1}{2} l^{2}} = \frac{2}{s}$$
So, if l = 100_000 and s = 100 (0.001 of l), we are reducing the computation time by 2% (?!). This is not just for a single update but rather for several calls of update that complete the distance matrix. So, I think the improvement is probably insignificant for a single call of update method.
I am not sure about my rough calculation though :)
I think keeping track of what pair-wise distances have been calculated can also help us avoid the duplicate issue that might appear in the top-k matrix profiles as discussed in PR #595.