stumpy
stumpy copied to clipboard
Fix Precision in Unit Test
This unit test causes a failure:
import numpy as np
import numpy.testing as npt
import naive
from stumpy import stump, config
import pandas as pd
def test_stump_identical_subsequence_self_join():
seed = 27343
np.random.seed(seed)
identical = np.random.rand(8)
T_A = np.random.rand(20)
T_A[1 : 1 + identical.shape[0]] = identical
T_A[11 : 11 + identical.shape[0]] = identical
m = 3
zone = int(np.ceil(m / 4))
ref_mp = naive.stump(T_A, m, exclusion_zone=zone, row_wise=True)
comp_mp = stump(T_A, m, ignore_trivial=True)
naive.replace_inf(ref_mp)
naive.replace_inf(comp_mp)
npt.assert_almost_equal(
ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
) # ignore indices
comp_mp = stump(pd.Series(T_A), m, ignore_trivial=True)
naive.replace_inf(comp_mp)
npt.assert_almost_equal(
ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
) # ignore indices
@seanlaw
I think I found the source of error. According to my investigation, the problem is in the precision of $\rho$. The discrepancy is in index 2 and 12. In fact S1 = T[2:2+m] and S2=T[12:12+m] are exactly the same and the $\rho$ between them must be 1.
However, according to the calculation of stumpy.stump, we can see that the highest value in $\rho[:, 2, 0]$ (print it in this line) is 0.9999999999483034.
idx = 2 # the start index of subsequence of interest
arg = np.argmax(ρ[:, idx, 0])
>>> I[arg, 2, 0]
12 # this is NN to the subsequence at index idx=2
>>> ρ[arg, idx, 0]
0.9999999999483034
A potential solution:
~We may prevent this by creating config.STUMPY_RHO_PRECISION and use it in the line of code I mentioned above before comparing the $\rho$ values, calculated in each thread, with each other (and then set it to 1 if the calculated value is too close to 1 (?!))~
@seanlaw
I do not know if this discrepancy (due to the numerical error) can happen in other cases as well or not... like when the true (exact) $\rho$ is not 1 (not identical) but we have difference between stumpy.stump and naive.py/stump due to numerical error.
How to handle identical case without using something like config.STUMPY_RHO_PRECISION? I mean... if we try to add STUMPY_RHO_PRECISION to config.py and use it to replace close-to-1 $\rho$ values to 1, then it may wrongly impact other cases.
So, how about just handling identical cases? The following solution works for me BUT it is not efficient for massive data, say 10M data points. I think large-size data has higher chance of hash collision and thus increase the computing time (?)
# in core.py
def preprocess_diagonal(T, m):
...
# add a new part to find duplicates
identical = {}
for i in range(T.shape[0] - m + 1):
identical[tuple(T[i : i + m])] = []
for i in range(T.shape[0] - m + 1):
identical[tuple(T[i : i + m])].append(i)
# O(N) time complexity
T_subseq_IdenticalCode = np.full(T.shape[0] - m + 1, -1, np.int64)
for i, (key, item) in enumerate(identical.items()):
for idx in item:
T_subseq_IdenticalCode[idx] = i
# return T_subseq_IdenticalCode in addition to the other outputs
And, then we check if two subsequences are identical in stumpy._compute_diagonal
# in stumpy._compute_diagonal
if T_A_subseq_IdenticalCode[i + k] == T_A_subseq_IdenticalCode[i]:
pearson = 1.0
# or we can do it at the end to correct the values of matrix profile
An alternative way is to let collisions happen to make the process faster, and check it later.
# I think it O(N) time complexity
l = T.shape[0] - m + 1
subsequence_hash = np.empty(l , dtype=np.int64)
for i in range(l ):
subsequence_hash[i] = hash(tuple(T[i:i+m]))
# there is a chance that we might have collision but that is okay as we will check it out later.
Then, we can correct the matrix profile (i.e. matrix profile/ matrix profile left / matrix profile right) at the end:
# I think it O(N) time complexity having assumed that the z-norm distance between two subsequences is O(1).
for row in range(l):
idx = row
for col in range(3):
idx_nn = I[row, col] # use matrix profile index to find NN of idx
if subsequence_hash[idx] == subsequence_hash[idx_nn]:
# two subsequences might be identical since their hash is the same
if np.all(Tznorm[idx : idx + m] - Tznorm[idx_nn : idx_nn + m]) == 0): # check if two subsequences are actually identical
P[row, col] == 0
Is it worth it? we are increasing time in pre-processing and post-processing steps. (I believe both of these steps can be parallelized via prange)
@NimaSarajpoor I don't think hash is a good idea as it is slow and pushing subsequences into a Dict may cause a lot of memory to be used as the length of the time series increases. Additionally, we'd need to account for z-normalization in these cases. Just because two subsequences aren't identical doesn't mean that their z-normalized subsequences aren't identical.
I don't know if it's worth addressing or what the best option is. Setting a config.STUMPY_RHO_PRECISION (or some other name like config.STUMPY_PERFECT_CORRELATION = 0.9999999) might not be the worst idea. At least, people can set it it to 1.0 if they dislike the precision that we've chosen and then we'd only need to compare if $\rho$ > config.STUMPY_PERFECT_CORRELATION: and then shift it to 1.0` or something.
Alternatively, I wonder if we can check $\rho$ and if it is above the config.STUMPY_PERFECT_CORRELATION then we should compute it directly using the dot product instead of using the distance from the previous diagonal.
I was wondering if what you have in mind is to check each single $\rho$ that is being calculated in stumpy.stump._compute_diagonal. If yes, then in that case, the alternative approach
Alternatively, I wonder if we can check and if it is above the config.STUMPY_PERFECT_CORRELATION then we should compute it directly using the dot product instead of using the distance from the previous diagonal.
might (hugely ?) increase the computing time if the data mostly show periodical behavior. (because, we will have many identical subsequences)
On the other hand, I think we also want to make sure the values in final matrix profile are precise. I mean...if we say idx_nn = I[idx], then dist(T[idx:idx+m], T[idx_nn, idx_nn+m]) should be exact.
So, I think we can just re-calculate distance at the end after checking $\rho > config.STUMPY\_PERFECT\_CORRELATION$.
(I know we should better re-calculate it throughout the stumpy.stump._compute_diagonal but I think this is too much for a rare case like this and, as I mentioned earlier, we may increase computing time when there are many identical subsequences. So, just correcting matrix profile at the end should be okay)
I think the only thing we'd need to change is add a line after L188 in _compute_diagonal:
https://github.com/TDAmeritrade/stumpy/blob/576de379994896063be71d264e48a08ae17476cf/stumpy/stump.py#L183-L188
if T_B_subseq_isfinite[i + k] and T_A_subseq_isfinite[i]:
# Neither subsequence contains NaNs
if T_B_subseq_isconstant[i + k] or T_A_subseq_isconstant[i]:
pearson = 0.5
else:
pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i]
if pearson > config.STUMPY_PERFECT_CORRELATION:
pearson = 1.0
I think that this is the simplest and should be a negligible cost as there is an if comparison (which should be cheap??) and then a value is set (not computed).
This should be simple, easy to maintain, easy to interpret, and if users don't like it then config.STUMPY_PERFECT_CORRELATION is easy to understand
I think we also want to make sure the values in final matrix profile are precise
I don't know. Unfortunately, I don't think we'll get much more "precise". It's a trade off between speed and a slight loss in precision (but not much, actually). In the worst case, if you have a long sine wave (time series) OR when each subsequence has a perfect match, you'll end up computing all of the pairwise distances twice (once during _stump and then once during post-processing if you do it your way! I believe that in my example above, it should be reasonably efficient but one should verify.
As I think about it more, if we were to do this as a post-processing step, we would write a separate function to stumpy.stump. I would not add the post-processing step to the existing function. I think my proposal above is the way to go
I think that this is the simplest and should be a negligible cost as there is an if comparison (which should be cheap??) and then a value is set (not computed).
Correct. As you said, I think is better and enough for such rare case (you do not have to recalculate distances and you modify the values throughout the process).
As a cross reference to #655 this related test is also failing:
def test_stumpi_identical_subsequence_self_join_egress():
m = 3
seed = np.random.randint(100000)
seed = 84451
np.random.seed(seed)
@seanlaw I think you may want to see this! Apparently, it is not just for the identical subsequences. So, I created a test where I randomly scale up (* 1000) or scale down (0.001) chunks of time series.
Here is the test function:
def test_stump_imprecision():
seed = np.random.randint(100000)
np.random.seed(seed)
T = np.random.rand(64)
# randomly choose some chunks of time series and scale them up or down
T_mask = np.full(len(T), 0, dtype=bool)
k = np.random.randint(len(T) // 4, len(T))
IDX = np.random.choice(np.arange(len(T)), k, replace=False)
T_mask[IDX] = True
slices = core._get_mask_slices(T_mask)
scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)
for i, (start, stop) in enumerate(slices):
T[start:stop] = scale[i] * T[start:stop]
m = 3
zone = int(np.ceil(m / 4))
ref_mp = naive.stump(T, m, exclusion_zone=zone)
comp_mp = stump(T, m, ignore_trivial=True)
naive.replace_inf(ref_mp)
naive.replace_inf(comp_mp)
ref_P = ref_mp[:, 0].astype(np.float64)
comp_P = comp_mp[:, 0].astype(np.float64)
npt.assert_almost_equal(ref_P, comp_P)
And, here is the error:
Mismatched elements: 37 / 62 (59.7%)
E Max absolute difference: 0.22458425
E Max relative difference: 0.78388078
E x: array([2.6611068e-02, 7.4557520e-05, 5.2291006e-02, 3.3167208e-02,
E 1.0766382e-04, 4.5937140e-02, 4.4588591e-02, 3.0992438e-04,
E 1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810141e-01,...
E y: array([2.6611068e-02, 8.7222889e-05, 7.4846320e-02, 3.3167208e-02,
E 1.3051016e-04, 4.5937140e-02, 7.4861289e-02, 3.0992438e-04,
E 1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810142e-01,...
Hmm, so what do you think is the issue? I'm assuming that naive.stump is "correct"?
I'm assuming that naive.stump is "correct"?
Yeah...I think so. I took a look at the naive.stump and it just do z-norm on each single subsequence and then calculate the pair-wise distances. So, I assume naive.stump should be correct.
According to my investigation, I think one problem is the imprecision in the calculation of sliding standard deviation Σ_T, particularly when we have large numbers. (Just for the record: Even if we take std of each subseq (instead of rolling std), we can reduce the error significantly, but we still have error!)
One way to make it a little bit better is standardization on the whole time series T (so, we scale std of each subseq by std(T)), and then rescale back.
Another way that I think should work (I tested it to some extent) is that we refine the welford nanvar/nanstd. We can do it by calculating the rolling nanvar/nanstd for the subsequences each scaled by their minmax, and then scale back at the end.
# in stumpy/core.py
@njit
def rolling_minmax_scale(a, m):
"""
Docstring
"""
l = len(a) - m + 1
scale = np.empty(l, dtype=np.float64)
for i in range(l):
scale[i] = np.max(a[i : i + m]) - np.min(a[i : i + m])
if scale[i] == 0:
scale[i] = 1
return scale
Then, we use this to modify the welford as follows:
def _welford_nanvar(a, w, a_subseq_isfinite):
"""
Compute the rolling variance
"""
all_variances = np.empty(a.shape[0] - w + 1, dtype=np.float64)
prev_mean = 0.0
prev_var = 0.0
scale = rolling_minmax_scale(a, w) # NEW
for start_idx in range(a.shape[0] - w + 1):
prev_start_idx = start_idx - 1
stop_idx = start_idx + w # Exclusive index value
last_idx = start_idx + w - 1 # Last inclusive index value
if (
start_idx == 0
or not a_subseq_isfinite[prev_start_idx]
or not a_subseq_isfinite[start_idx]
):
curr_mean = np.nanmean(a[start_idx:stop_idx])
curr_var = np.nanvar(a[start_idx:stop_idx] / scale[start_idx]) # MODIFIED: minmax-scaled
else:
curr_mean = prev_mean + (a[last_idx] - a[prev_start_idx]) / w
curr_var = (
prev_var * np.square(scale[prev_start_idx] / scale[start_idx])
+ ((a[last_idx] - a[prev_start_idx]) / scale[start_idx])
* (
(a[last_idx] - curr_mean + a[prev_start_idx] - prev_mean)
/ scale[start_idx]
)
/ w
) # MODIFIED: reversing scale and scale again! # So, our curr_var is `var` of minmax-scaled subseq
all_variances[start_idx] = curr_var
prev_mean = curr_mean
prev_var = curr_var
return all_variances # this is now var of subsequences, where each subseq is scaled by minmax
And finally, reverse everything back to normal at the end:
def welford_nanstd(a, w=None):
"""
This a convenience wrapper around `welford_nanvar`.
"""
if w is None:
w = a.shape[0]
a_subseq_isfinite = rolling_isfinite(a, w)
scale = rolling_minmax_scale(a, w) # NEW
out = np.sqrt(np.clip(welford_nanvar(a, w), a_min=0, a_max=None))
out[a_subseq_isfinite] = scale[a_subseq_isfinite] * out[a_subseq_isfinite] # scale back to normal
return out
For instance, I am comparing the proposed one with the existing version:
- imprecision test mentioned above:
test_stump_imprecision(seed = 0) --> existing version: max abs diff: 0.36033988, proposed welford approach: 0.09287705 - for the test function for identical case (but not scaling random chunks by 1000 or 0.001): --> existing version: resolved by new config variable, proposed welford approach: no need to add config. It is all good!
@seanlaw In another test that I designed, the proposed welford performed worse. So, you may ignore the idea (I just did not delete it for our future ref)
Yeah, I'm pretty confident that Welford is about as precise as we're going to get and it has the best speed to precision tradeoff. 😢
@seanlaw As you mentioned, it seems that welford is okay. I checked its test function and tried it on the time series of the test functions.
I think I found the issue. Note that in stumpy/stump.py, we do:
https://github.com/TDAmeritrade/stumpy/blob/d57169f3a8c3a1571c12d84f490e8c7e511b224e/stumpy/stump.py#L188
And, it seems that the high precision in std does not guarantee the high precision of its inverse when std is small.
Example:
>>> T
array([2.42145823e-01, 2.88656701e-04, 9.34900364e-04, 9.34545813e-04,
9.36884255e-04, 3.72888654e-04, 3.99196666e-04, 9.97234658e-04,
2.04350128e-04, 8.63723288e-01, 1.86409747e-01, 2.88656701e+02,
9.34900364e+02, 9.34545813e+02, 9.36884255e+02, 3.72888654e+02,
3.99196666e+02, 9.97234658e+02, 2.04350128e+02, 6.26529775e-01])
Let's take a look at std of this input T, with m=3, at index 2 (i.e. for subseq T[2 : 5])
# ref regards to exact std
# comp regards to welford std
np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12
np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177
And, if we compare the matrix profile ref_P (from naive.stump) and comp_P(from stumpy.stump), we can see that the npt.assert_almost_equal(...) fails at index 2 and 12.
(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))
Ohh, wow! That's REALLY bad but also fun at the same time 😭
Yeah...I also tried many many different ways to resolve this. I have not been successful so far.
FYI:
(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))
If we scale up (down) by 10000 (0.0001), then:
np.abs(ref_std_inv[2] - comp_std_inv[2])
will be around 1000! (btw, we should also consider the impact of cov as well in calculating pearson as shown in the line of code provided in my previous comment)
The exact distance between subseq_2 and subseq_12 is 1e-16, and the stump (which is based on pearson) gives about 0.02.
np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12 np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177
What happens when you print (1.0 / ref_std[2]) and print(1.0 / comp_std[2])?
Can you also print ref_std[2] and comp_std[2]? Or give me the seed to generate T precisely. I tried to copy/paste your T above but am not seeing the issue.
What happens when you
print (1.0 / ref_std[2])andprint(1.0 / comp_std[2])?
>>> 1.0 / ref_std[2]
971803.3300204428
>>> 1.0 / comp_std[2]
971802.1137688879
In case that matters:
Since we scale the whole identical seq by 0.001, we are scaling down its standard deviation as well, so std(scaled_seq) is equal to 0.001 * std(seq). So, we are dealing with smaller std, and hence, its inverse becomes large. So let's say an actual value is $a$, and the value with imprecision is $a + \epsilon$, then, the imprecision of the inverse is:
$\frac{1}{a} - \frac{1}{a + \epsilon} = \frac{\epsilon}{a(a + \epsilon)}$, so, if $a$ is small itself, the imprecision of the inverse becomes large.
Can you tell me what seed and code you are using to generate T?
I think this should give you the T you are looking for.
def identical_cases_with_scale():
m = 3
zone = int(np.ceil(m / 4))
seed = 27343
np.random.seed(seed)
identical = np.random.rand(8)
T_A = np.random.rand(20)
T_A[1 : 1 + identical.shape[0]] = identical * 0.001
T_A[11 : 11 + identical.shape[0]] = identical * 1000
ref_mp = stump(T_A, m, exclusion_zone=zone, row_wise=True)
comp_mp = stumpy.stump(T_A, m, ignore_trivial=True)
replace_inf(ref_mp)
replace_inf(comp_mp)
return T_A, ref_mp, comp_mp
Hmm, I wonder in compute_mean_std, if AFTER we compute Σ_T (stddev), would be worth:
- Check which stddev values are below some threshold
- Correct those stddevs
The reason why we use Welford is because of its speed and reasonable precision. As you identified, for the most part, what Welford produces is fine. However, for really small numbers, it's insufficient. And so maybe we add a correction where we use np.std for those specific windows?
Any thoughts?
So, after line #869 in core.compute_mean_stddev:
https://github.com/TDAmeritrade/stumpy/blob/67b822c9a6ec4a5128ad10c5bff287f0524109b5/stumpy/core.py#L860-L876
we can add something like this:
idx = np.where(Σ_T < config.STUMPY_STDDEV_CORRECTION_THRESHOLD) # config.STUMPY_STDDEV_CORRECTION_THRESHOLD = 1e-5
if len(idx) > 0:
Σ_T[idx] = np.std(core.rolling_window(T, m)[idx])
This might not work for the multi-dimensional case but I hope you can get my point
This might not work for the multi-dimensional case but I hope you can get my point
After I saw your suggestion, I tried it by calculating an upper bound for std, std_ub, in welford and then I checked if the std_ub is below the threshold 1e-5. In such case, I simply did np.nanvar(..). This fixes the issue and now the test function with scaled identical sequences pass.
Now, there is one more thing we might be interested in taking care of. See test function below:
def test_stump_volatile_case1():
seed = 0
np.random.seed(seed)
T = np.random.rand(64)
T_mask = np.full(len(T), 0, dtype=bool)
k = np.random.randint(len(T) // 4, len(T))
IDX = np.random.choice(np.arange(len(T)), k, replace=False)
T_mask[IDX] = True
slices = core._get_mask_slices(T_mask)
scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)
for i, (start, stop) in enumerate(slices):
T[start:stop] = scale[i] * T[start:stop]
m = 3
zone = int(np.ceil(m / 4))
ref_mp = naive.stump(T, m, exclusion_zone=zone)
comp_mp = stump(T, m, ignore_trivial=True)
naive.replace_inf(ref_mp)
naive.replace_inf(comp_mp)
ref_P = ref_mp[:, 0].astype(np.float64)
comp_P = comp_mp[:, 0].astype(np.float64)
npt.assert_almost_equal(ref_P, comp_P)
Here, I randomly scaled up or down the values.
btw, I just realized a cleaner way was to do:
scale = np.random.choice(np.array([0.001, 1, 1000]), len(T), replace=True)
And then:
T = T * scale
Okay, let's get back to our main topic. This test function still fails. In fact, I used the exact mean and std, i.e. np.nanmean(rolling_window(...), axis=1) and np.sqrt(np.nanvar(rolling_window(...), axis=1))! I wanted to make sure I am not inserting any inaccuracy from these two parameters.
Here is the error:
Mismatched elements: 11 / 62 (17.7%)
E Max absolute difference: 0.03931141
E Max relative difference: 0.65784229
This shows that the pearson approach, i.e. using rolling cov, is probably another source of imprecision. I haven't investigated this latter test case. so I don't know what is happening behind the scene!
Side note:
It is worthwhile to note that the matrix profile indices are the same in this case. (So, we can keep in mind that our last resort can be recalculating distances based on the nearest-neighbor indices provided in I. Similar to what we did in stumpi to retrieve the left matrix profile values)
@seanlaw
Btw, I would like to mention that the current solution, i.e. checking std and re-calculating its exact value may not be an efficient solution if ALL VALUES of time series are small, e.g. T=[0.001, 0.003, -0.002, 0 0.001, ....]. In this cases, we are going to end up with calculating the exact value of std for all subseqs (so, not fast)! I know this is a very rare case, but I just wanted to inform you of this case. I have some ideas but they are not well developed :)
Yeah, I don't think we should move away from using welford because it is not only reasonably precise but it doesn't use up all of our memory. With nanstd, it can use a ton of memory when we have a sufficiently large T. At some point, we just can't fix all precision issues. We can try our best and leave it at that. We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?
Alternatively, can we somehow avoid the inverse and still use cov?
We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?
Correct! I think the current version is already good and because of that we should just try to improve it without drastically sacrificing the speed.
Can we somehow avoid the inverse
Yeah...I think that might help if it is possible. I thought about it and even tried to re-arrange the equations in the code but that did not help. I will think about it again.
Update
This might not work for the multi-dimensional case but I hope you can get my point
After I saw your suggestion, I tried it by calculating an upper bound for std,
std_ub, inwelfordand then I checked if thestd_ubis below the threshold 1e-5. In such case, I simply didnp.nanvar(..). This fixes the issue and now the test function with scaled identical sequences pass.Now, there is one more thing we might be interested in taking care of. See test function below:
def test_stump_volatile_case1():
the test function def test_stump_volatile_case1(): fails even when mean and std are calculated naively AND fastmath is turned off.
the test function def test_stump_volatile_case1(): fails even when mean and std are calculated naively AND fastmath is turned off.
Can you help me understand what that means and if there is anything we could do?
Can you help me understand what that means and if there is anything we could do?
So I did fastmath=False to remove its impact on imprecision if there is any. I also avoided using the rolling approach for calculating "mean" and "std" of subsequences. And.... I still got error.
I am still investigating this. I just wanted to update you and seek your opinion.