stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

Speeding up `scrump`

Open NimaSarajpoor opened this issue 3 years ago • 0 comments

I have an idea that might speed up scrump when its parameter pre_scrump is set to True. (After some thought, I concluded that this reduction in computing time would be probably insignificant. However, I am going to share my idea here anyway)

When pre_scrump=True, we calculate an approximate matrix profile by going through some samples (indices): https://github.com/TDAmeritrade/stumpy/blob/c432db42267f3f88698c4dd4cee9128bc01d563f/stumpy/scrump.py#L346

(note: l = n - m + 1)

And, for each sample i:

  • we calculate its distance profile.
  • we also use QT to calculate the distance between S_(i + j) and S_(nn_i + j) for j in range(1, s)
  • we also use QT to calculate the distance between S_(i - j) and S_(nn_i - j) for j in range(1, s)

So, of Combination(l, 2) cells of distance matrix (in self-join), we roughly calculate:

  • $\frac{l}{s} \times l$ (regarding distance profile)
  • $\frac{l}{s} \times s$ (regarding move forward)
  • $\frac{l}{s} \times s$ (regarding move backward)

So, at the end of prescrump process, we already calculated roughly $\frac{1}{s} l^{2} + 2l$ elements of distance matrix. I think if we assume l is large, we can then say:

  • we need to calculate $\frac{1}{2} l^{2}$ cells to obtain the full distance matrix (in self-join)
  • At the end of prescrump, we already calculated $\frac{1}{s} l^{2}$ cells of them.

So, if we avoid re-calculating $\frac{1}{s} l^{2}$ number of distances throughtout .update() process, then:

$$ \frac{\frac{1}{s} l^{2}}{\frac{1}{2} l^{2}} = \frac{2}{s}$$

So, if l = 100_000 and s = 100 (0.001 of l), we are reducing the computation time by 2% (?!). This is not just for a single update but rather for several calls of update that complete the distance matrix. So, I think the improvement is probably insignificant for a single call of update method.

I am not sure about my rough calculation though :)


I think keeping track of what pair-wise distances have been calculated can also help us avoid the duplicate issue that might appear in the top-k matrix profiles as discussed in PR #595.

NimaSarajpoor avatar Jul 15 '22 21:07 NimaSarajpoor