stumpy Fix Precision in Unit Test

This unit test causes a failure:

import numpy as np
import numpy.testing as npt
import naive
from stumpy import stump, config
import pandas as pd

def test_stump_identical_subsequence_self_join():
    seed = 27343
    np.random.seed(seed)

    identical = np.random.rand(8)
    T_A = np.random.rand(20)
    T_A[1 : 1 + identical.shape[0]] = identical
    T_A[11 : 11 + identical.shape[0]] = identical
    m = 3
    zone = int(np.ceil(m / 4))
    ref_mp = naive.stump(T_A, m, exclusion_zone=zone, row_wise=True)
    comp_mp = stump(T_A, m, ignore_trivial=True)
    naive.replace_inf(ref_mp)
    naive.replace_inf(comp_mp)
    npt.assert_almost_equal(
        ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
    )  # ignore indices

    comp_mp = stump(pd.Series(T_A), m, ignore_trivial=True)
    naive.replace_inf(comp_mp)
    npt.assert_almost_equal(
        ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
    )  # ignore indices

May 25 '22 02:05 seanlaw

@seanlaw I think I found the source of error. According to my investigation, the problem is in the precision of $\rho$. The discrepancy is in index 2 and 12. In fact S1 = T[2:2+m] and S2=T[12:12+m] are exactly the same and the $\rho$ between them must be 1.

However, according to the calculation of stumpy.stump, we can see that the highest value in $\rho[:, 2, 0]$ (print it in this line) is 0.9999999999483034.

idx = 2 # the start index of subsequence of interest
arg = np.argmax(ρ[:, idx, 0]) 
>>> I[arg, 2, 0]
12 # this is NN to the subsequence at index idx=2

>>> ρ[arg, idx, 0]
0.9999999999483034

A potential solution: ~We may prevent this by creating config.STUMPY_RHO_PRECISION and use it in the line of code I mentioned above before comparing the $\rho$ values, calculated in each thread, with each other (and then set it to 1 if the calculated value is too close to 1 (?!))~

Jun 12 '22 04:06 NimaSarajpoor

@seanlaw I do not know if this discrepancy (due to the numerical error) can happen in other cases as well or not... like when the true (exact) $\rho$ is not 1 (not identical) but we have difference between stumpy.stump and naive.py/stump due to numerical error.

How to handle identical case without using something like config.STUMPY_RHO_PRECISION? I mean... if we try to add STUMPY_RHO_PRECISION to config.py and use it to replace close-to-1 $\rho$ values to 1, then it may wrongly impact other cases.

So, how about just handling identical cases? The following solution works for me BUT it is not efficient for massive data, say 10M data points. I think large-size data has higher chance of hash collision and thus increase the computing time (?)

# in core.py 

def preprocess_diagonal(T, m):
   ...
   # add a new part to find duplicates
   identical = {}
   for i in range(T.shape[0] - m + 1):
       identical[tuple(T[i : i + m])] = []  
   for i in range(T.shape[0] - m + 1): 
       identical[tuple(T[i : i + m])].append(i) 
    
   # O(N) time complexity
   T_subseq_IdenticalCode = np.full(T.shape[0] - m + 1, -1, np.int64)
   for i, (key, item) in enumerate(identical.items()): 
       for idx in item:
           T_subseq_IdenticalCode[idx] = i


   # return T_subseq_IdenticalCode in addition to the other outputs

And, then we check if two subsequences are identical in stumpy._compute_diagonal

# in stumpy._compute_diagonal
if T_A_subseq_IdenticalCode[i + k] == T_A_subseq_IdenticalCode[i]:
     pearson = 1.0

# or we can do it at the end to correct the values of matrix profile

An alternative way is to let collisions happen to make the process faster, and check it later.

# I think it O(N) time complexity

l = T.shape[0]  - m + 1
subsequence_hash = np.empty(l , dtype=np.int64)
for i in range(l ):
    subsequence_hash[i] = hash(tuple(T[i:i+m])) 

# there is a chance that we might have collision but that is okay as we will check it out later.

Then, we can correct the matrix profile (i.e. matrix profile/ matrix profile left / matrix profile right) at the end:

# I think it O(N) time complexity having assumed that the z-norm distance between two subsequences is O(1).

for row in range(l):
   idx = row
   for col in range(3): 
       idx_nn = I[row, col] # use matrix profile index to find NN of idx
       if subsequence_hash[idx] == subsequence_hash[idx_nn]:
             # two subsequences might be identical since their hash is the same
             if np.all(Tznorm[idx : idx + m] - Tznorm[idx_nn : idx_nn + m]) == 0): # check if two subsequences are actually identical
                  P[row, col] == 0

Is it worth it? we are increasing time in pre-processing and post-processing steps. (I believe both of these steps can be parallelized via prange)

Jun 12 '22 19:06 NimaSarajpoor

@NimaSarajpoor I don't think hash is a good idea as it is slow and pushing subsequences into a Dict may cause a lot of memory to be used as the length of the time series increases. Additionally, we'd need to account for z-normalization in these cases. Just because two subsequences aren't identical doesn't mean that their z-normalized subsequences aren't identical.

I don't know if it's worth addressing or what the best option is. Setting a config.STUMPY_RHO_PRECISION (or some other name like config.STUMPY_PERFECT_CORRELATION = 0.9999999) might not be the worst idea. At least, people can set it it to 1.0 if they dislike the precision that we've chosen and then we'd only need to compare if $\rho$ > config.STUMPY_PERFECT_CORRELATION: and then shift it to 1.0` or something.

Alternatively, I wonder if we can check $\rho$ and if it is above the config.STUMPY_PERFECT_CORRELATION then we should compute it directly using the dot product instead of using the distance from the previous diagonal.

Jun 12 '22 20:06 seanlaw

I was wondering if what you have in mind is to check each single $\rho$ that is being calculated in stumpy.stump._compute_diagonal. If yes, then in that case, the alternative approach

Alternatively, I wonder if we can check and if it is above the config.STUMPY_PERFECT_CORRELATION then we should compute it directly using the dot product instead of using the distance from the previous diagonal.

might (hugely ?) increase the computing time if the data mostly show periodical behavior. (because, we will have many identical subsequences)

On the other hand, I think we also want to make sure the values in final matrix profile are precise. I mean...if we say idx_nn = I[idx], then dist(T[idx:idx+m], T[idx_nn, idx_nn+m]) should be exact.

So, I think we can just re-calculate distance at the end after checking $\rho > config.STUMPY\_PERFECT\_CORRELATION$. (I know we should better re-calculate it throughout the stumpy.stump._compute_diagonal but I think this is too much for a rare case like this and, as I mentioned earlier, we may increase computing time when there are many identical subsequences. So, just correcting matrix profile at the end should be okay)

Jun 12 '22 22:06 NimaSarajpoor

I think the only thing we'd need to change is add a line after L188 in _compute_diagonal:

https://github.com/TDAmeritrade/stumpy/blob/576de379994896063be71d264e48a08ae17476cf/stumpy/stump.py#L183-L188

if T_B_subseq_isfinite[i + k] and T_A_subseq_isfinite[i]: 
    # Neither subsequence contains NaNs 
    if T_B_subseq_isconstant[i + k] or T_A_subseq_isconstant[i]: 
        pearson = 0.5 
    else: 
        pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i] 
        if pearson > config.STUMPY_PERFECT_CORRELATION:
            pearson = 1.0

I think that this is the simplest and should be a negligible cost as there is an if comparison (which should be cheap??) and then a value is set (not computed).

This should be simple, easy to maintain, easy to interpret, and if users don't like it then config.STUMPY_PERFECT_CORRELATION is easy to understand

I think we also want to make sure the values in final matrix profile are precise

I don't know. Unfortunately, I don't think we'll get much more "precise". It's a trade off between speed and a slight loss in precision (but not much, actually). In the worst case, if you have a long sine wave (time series) OR when each subsequence has a perfect match, you'll end up computing all of the pairwise distances twice (once during _stump and then once during post-processing if you do it your way! I believe that in my example above, it should be reasonably efficient but one should verify.

Jun 13 '22 11:06 seanlaw

As I think about it more, if we were to do this as a post-processing step, we would write a separate function to stumpy.stump. I would not add the post-processing step to the existing function. I think my proposal above is the way to go

Jun 13 '22 12:06 seanlaw

I think that this is the simplest and should be a negligible cost as there is an if comparison (which should be cheap??) and then a value is set (not computed).

Correct. As you said, I think is better and enough for such rare case (you do not have to recalculate distances and you modify the values throughout the process).

Jun 13 '22 15:06 NimaSarajpoor

As a cross reference to #655 this related test is also failing:

def test_stumpi_identical_subsequence_self_join_egress():
    m = 3

    seed = np.random.randint(100000)
    seed = 84451
    np.random.seed(seed)

Aug 02 '22 20:08 seanlaw

@seanlaw I think you may want to see this! Apparently, it is not just for the identical subsequences. So, I created a test where I randomly scale up (* 1000) or scale down (0.001) chunks of time series.

Here is the test function:

def test_stump_imprecision():
    seed = np.random.randint(100000)
    np.random.seed(seed)

    T = np.random.rand(64)
    
    # randomly choose some chunks of time series and scale them up or down 
    T_mask = np.full(len(T), 0, dtype=bool)
    k = np.random.randint(len(T) // 4, len(T))
    IDX = np.random.choice(np.arange(len(T)), k, replace=False)
    T_mask[IDX] = True
    slices = core._get_mask_slices(T_mask)
    scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)
    for i, (start, stop) in enumerate(slices):
        T[start:stop] = scale[i] * T[start:stop]

    m = 3
    zone = int(np.ceil(m / 4))
    ref_mp = naive.stump(T, m, exclusion_zone=zone)
    comp_mp = stump(T, m, ignore_trivial=True)
    naive.replace_inf(ref_mp)
    naive.replace_inf(comp_mp)

    ref_P = ref_mp[:, 0].astype(np.float64)
    comp_P = comp_mp[:, 0].astype(np.float64)

    npt.assert_almost_equal(ref_P, comp_P)

And, here is the error:

 Mismatched elements: 37 / 62 (59.7%)
E       Max absolute difference: 0.22458425
E       Max relative difference: 0.78388078
E        x: array([2.6611068e-02, 7.4557520e-05, 5.2291006e-02, 3.3167208e-02,
E              1.0766382e-04, 4.5937140e-02, 4.4588591e-02, 3.0992438e-04,
E              1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810141e-01,...
E        y: array([2.6611068e-02, 8.7222889e-05, 7.4846320e-02, 3.3167208e-02,
E              1.3051016e-04, 4.5937140e-02, 7.4861289e-02, 3.0992438e-04,
E              1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810142e-01,...

Aug 03 '22 02:08 NimaSarajpoor

Hmm, so what do you think is the issue? I'm assuming that naive.stump is "correct"?

Aug 03 '22 13:08 seanlaw

I'm assuming that naive.stump is "correct"?

Yeah...I think so. I took a look at the naive.stump and it just do z-norm on each single subsequence and then calculate the pair-wise distances. So, I assume naive.stump should be correct.

According to my investigation, I think one problem is the imprecision in the calculation of sliding standard deviation Σ_T, particularly when we have large numbers. (Just for the record: Even if we take std of each subseq (instead of rolling std), we can reduce the error significantly, but we still have error!)

One way to make it a little bit better is standardization on the whole time series T (so, we scale std of each subseq by std(T)), and then rescale back.

Another way that I think should work (I tested it to some extent) is that we refine the welford nanvar/nanstd. We can do it by calculating the rolling nanvar/nanstd for the subsequences each scaled by their minmax, and then scale back at the end.

# in stumpy/core.py
@njit
def rolling_minmax_scale(a, m):
    """
    Docstring
    """
    l = len(a) - m + 1
    scale = np.empty(l, dtype=np.float64)
    for i in range(l):
        scale[i] = np.max(a[i : i + m]) - np.min(a[i : i + m])
       if scale[i] == 0:
          scale[i] = 1
     
    return scale

Then, we use this to modify the welford as follows:

def _welford_nanvar(a, w, a_subseq_isfinite):
    """
    Compute the rolling variance 
    """
    all_variances = np.empty(a.shape[0] - w + 1, dtype=np.float64)
    prev_mean = 0.0
    prev_var = 0.0

    scale = rolling_minmax_scale(a, w) # NEW

    for start_idx in range(a.shape[0] - w + 1):
        prev_start_idx = start_idx - 1
        stop_idx = start_idx + w  # Exclusive index value
        last_idx = start_idx + w - 1  # Last inclusive index value

        if (
            start_idx == 0
            or not a_subseq_isfinite[prev_start_idx]
            or not a_subseq_isfinite[start_idx]
        ):
            curr_mean = np.nanmean(a[start_idx:stop_idx])
            curr_var = np.nanvar(a[start_idx:stop_idx] / scale[start_idx]) # MODIFIED: minmax-scaled
        else:
            curr_mean = prev_mean + (a[last_idx] - a[prev_start_idx]) / w

            curr_var = (
                prev_var * np.square(scale[prev_start_idx] / scale[start_idx])
                + ((a[last_idx] - a[prev_start_idx]) / scale[start_idx])
                * (
                    (a[last_idx] - curr_mean + a[prev_start_idx] - prev_mean)
                    / scale[start_idx]
                )
                / w
            ) # MODIFIED: reversing scale and scale again! # So, our curr_var is `var` of minmax-scaled subseq

        all_variances[start_idx] = curr_var

        prev_mean = curr_mean
        prev_var = curr_var

    return all_variances # this is now var of subsequences, where each subseq is scaled by minmax

And finally, reverse everything back to normal at the end:

def welford_nanstd(a, w=None):
    """
    This a convenience wrapper around `welford_nanvar`.
     """
    if w is None:
        w = a.shape[0]

    a_subseq_isfinite = rolling_isfinite(a, w)
    scale = rolling_minmax_scale(a, w) # NEW

    out = np.sqrt(np.clip(welford_nanvar(a, w), a_min=0, a_max=None))
    out[a_subseq_isfinite] = scale[a_subseq_isfinite] * out[a_subseq_isfinite] # scale back to normal

    return out

For instance, I am comparing the proposed one with the existing version:

imprecision test mentioned above: test_stump_imprecision (seed = 0) --> existing version: max abs diff: 0.36033988, proposed welford approach: 0.09287705
for the test function for identical case (but not scaling random chunks by 1000 or 0.001): --> existing version: resolved by new config variable, proposed welford approach: no need to add config. It is all good!

Aug 03 '22 17:08 NimaSarajpoor

@seanlaw In another test that I designed, the proposed welford performed worse. So, you may ignore the idea (I just did not delete it for our future ref)

Aug 03 '22 19:08 NimaSarajpoor

Yeah, I'm pretty confident that Welford is about as precise as we're going to get and it has the best speed to precision tradeoff. 😢

Aug 03 '22 23:08 seanlaw

@seanlaw As you mentioned, it seems that welford is okay. I checked its test function and tried it on the time series of the test functions.

I think I found the issue. Note that in stumpy/stump.py, we do:

https://github.com/TDAmeritrade/stumpy/blob/d57169f3a8c3a1571c12d84f490e8c7e511b224e/stumpy/stump.py#L188

And, it seems that the high precision in std does not guarantee the high precision of its inverse when std is small.

Example:

>>> T
array([2.42145823e-01, 2.88656701e-04, 9.34900364e-04, 9.34545813e-04,
       9.36884255e-04, 3.72888654e-04, 3.99196666e-04, 9.97234658e-04,
       2.04350128e-04, 8.63723288e-01, 1.86409747e-01, 2.88656701e+02,
       9.34900364e+02, 9.34545813e+02, 9.36884255e+02, 3.72888654e+02,
       3.99196666e+02, 9.97234658e+02, 2.04350128e+02, 6.26529775e-01])

Let's take a look at std of this input T, with m=3, at index 2 (i.e. for subseq T[2 : 5])

# ref regards to exact std
# comp regards to welford std

np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12
np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177

And, if we compare the matrix profile ref_P (from naive.stump) and comp_P(from stumpy.stump), we can see that the npt.assert_almost_equal(...) fails at index 2 and 12.

(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))

Aug 04 '22 16:08 NimaSarajpoor

Ohh, wow! That's REALLY bad but also fun at the same time 😭

Aug 04 '22 16:08 seanlaw

Yeah...I also tried many many different ways to resolve this. I have not been successful so far.

FYI:

(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))

If we scale up (down) by 10000 (0.0001), then:

np.abs(ref_std_inv[2] - comp_std_inv[2])

will be around 1000! (btw, we should also consider the impact of cov as well in calculating pearson as shown in the line of code provided in my previous comment)

The exact distance between subseq_2 and subseq_12 is 1e-16, and the stump (which is based on pearson) gives about 0.02.

Aug 04 '22 17:08 NimaSarajpoor

np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12 np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177

What happens when you print (1.0 / ref_std[2]) and print(1.0 / comp_std[2])?

Can you also print ref_std[2] and comp_std[2]? Or give me the seed to generate T precisely. I tried to copy/paste your T above but am not seeing the issue.

Aug 04 '22 19:08 seanlaw

What happens when you print (1.0 / ref_std[2]) and print(1.0 / comp_std[2])?

>>> 1.0 / ref_std[2] 
971803.3300204428

>>> 1.0 / comp_std[2]
971802.1137688879

In case that matters: Since we scale the whole identical seq by 0.001, we are scaling down its standard deviation as well, so std(scaled_seq) is equal to 0.001 * std(seq). So, we are dealing with smaller std, and hence, its inverse becomes large. So let's say an actual value is $a$, and the value with imprecision is $a + \epsilon$, then, the imprecision of the inverse is:

$\frac{1}{a} - \frac{1}{a + \epsilon} = \frac{\epsilon}{a(a + \epsilon)}$, so, if $a$ is small itself, the imprecision of the inverse becomes large.

Aug 04 '22 20:08 NimaSarajpoor

Can you tell me what seed and code you are using to generate T?

Aug 04 '22 20:08 seanlaw

I think this should give you the T you are looking for.

def identical_cases_with_scale():
    m = 3
    zone = int(np.ceil(m / 4))

    seed = 27343
    np.random.seed(seed)

    identical = np.random.rand(8)
    T_A = np.random.rand(20)
    T_A[1 : 1 + identical.shape[0]] = identical * 0.001  
    T_A[11 : 11 + identical.shape[0]] = identical * 1000  

    ref_mp = stump(T_A, m, exclusion_zone=zone, row_wise=True)
    comp_mp = stumpy.stump(T_A, m, ignore_trivial=True)
    replace_inf(ref_mp)
    replace_inf(comp_mp)
    
    return T_A, ref_mp, comp_mp

Aug 04 '22 20:08 NimaSarajpoor

Hmm, I wonder in compute_mean_std, if AFTER we compute Σ_T (stddev), would be worth:

Check which stddev values are below some threshold
Correct those stddevs

The reason why we use Welford is because of its speed and reasonable precision. As you identified, for the most part, what Welford produces is fine. However, for really small numbers, it's insufficient. And so maybe we add a correction where we use np.std for those specific windows?

Any thoughts?

Aug 04 '22 20:08 seanlaw

So, after line #869 in core.compute_mean_stddev:

https://github.com/TDAmeritrade/stumpy/blob/67b822c9a6ec4a5128ad10c5bff287f0524109b5/stumpy/core.py#L860-L876

we can add something like this:

idx = np.where(Σ_T < config.STUMPY_STDDEV_CORRECTION_THRESHOLD)  # config.STUMPY_STDDEV_CORRECTION_THRESHOLD = 1e-5
if len(idx) > 0:
    Σ_T[idx] = np.std(core.rolling_window(T, m)[idx])

This might not work for the multi-dimensional case but I hope you can get my point

Aug 04 '22 21:08 seanlaw

This might not work for the multi-dimensional case but I hope you can get my point

After I saw your suggestion, I tried it by calculating an upper bound for std, std_ub, in welford and then I checked if the std_ub is below the threshold 1e-5. In such case, I simply did np.nanvar(..). This fixes the issue and now the test function with scaled identical sequences pass.

Now, there is one more thing we might be interested in taking care of. See test function below:

def test_stump_volatile_case1():
        seed = 0
        np.random.seed(seed)

        T = np.random.rand(64)

        T_mask = np.full(len(T), 0, dtype=bool)
        k = np.random.randint(len(T) // 4, len(T))
        IDX = np.random.choice(np.arange(len(T)), k, replace=False)

        T_mask[IDX] = True
        slices = core._get_mask_slices(T_mask)

        scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)

        for i, (start, stop) in enumerate(slices):
            T[start:stop] = scale[i] * T[start:stop]

        m = 3
        zone = int(np.ceil(m / 4))
        ref_mp = naive.stump(T, m, exclusion_zone=zone)
        comp_mp = stump(T, m, ignore_trivial=True)
        naive.replace_inf(ref_mp)
        naive.replace_inf(comp_mp)

        ref_P = ref_mp[:, 0].astype(np.float64)
        comp_P = comp_mp[:, 0].astype(np.float64)

        npt.assert_almost_equal(ref_P, comp_P)

Here, I randomly scaled up or down the values.

btw, I just realized a cleaner way was to do:

scale = np.random.choice(np.array([0.001, 1, 1000]), len(T), replace=True)
And then:
T = T * scale

Okay, let's get back to our main topic. This test function still fails. In fact, I used the exact mean and std, i.e. np.nanmean(rolling_window(...), axis=1) and np.sqrt(np.nanvar(rolling_window(...), axis=1))! I wanted to make sure I am not inserting any inaccuracy from these two parameters.

Here is the error:

Mismatched elements: 11 / 62 (17.7%)
E           Max absolute difference: 0.03931141
E           Max relative difference: 0.65784229

This shows that the pearson approach, i.e. using rolling cov, is probably another source of imprecision. I haven't investigated this latter test case. so I don't know what is happening behind the scene!

Side note: It is worthwhile to note that the matrix profile indices are the same in this case. (So, we can keep in mind that our last resort can be recalculating distances based on the nearest-neighbor indices provided in I. Similar to what we did in stumpi to retrieve the left matrix profile values)

Aug 05 '22 00:08 NimaSarajpoor

@seanlaw Btw, I would like to mention that the current solution, i.e. checking std and re-calculating its exact value may not be an efficient solution if ALL VALUES of time series are small, e.g. T=[0.001, 0.003, -0.002, 0 0.001, ....]. In this cases, we are going to end up with calculating the exact value of std for all subseqs (so, not fast)! I know this is a very rare case, but I just wanted to inform you of this case. I have some ideas but they are not well developed :)

Aug 05 '22 00:08 NimaSarajpoor

Yeah, I don't think we should move away from using welford because it is not only reasonably precise but it doesn't use up all of our memory. With nanstd, it can use a ton of memory when we have a sufficiently large T. At some point, we just can't fix all precision issues. We can try our best and leave it at that. We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?

Aug 05 '22 01:08 seanlaw

Alternatively, can we somehow avoid the inverse and still use cov?

Aug 05 '22 01:08 seanlaw

We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?

Correct! I think the current version is already good and because of that we should just try to improve it without drastically sacrificing the speed.

Can we somehow avoid the inverse

Yeah...I think that might help if it is possible. I thought about it and even tried to re-arrange the equations in the code but that did not help. I will think about it again.

Aug 05 '22 03:08 NimaSarajpoor

Update

This might not work for the multi-dimensional case but I hope you can get my point

After I saw your suggestion, I tried it by calculating an upper bound for std, std_ub, in welford and then I checked if the std_ub is below the threshold 1e-5. In such case, I simply did np.nanvar(..). This fixes the issue and now the test function with scaled identical sequences pass.

Now, there is one more thing we might be interested in taking care of. See test function below:
def test_stump_volatile_case1():

the test function def test_stump_volatile_case1(): fails even when mean and std are calculated naively AND fastmath is turned off.

Aug 09 '22 19:08 NimaSarajpoor

the test function def test_stump_volatile_case1(): fails even when mean and std are calculated naively AND fastmath is turned off.

Can you help me understand what that means and if there is anything we could do?

Aug 10 '22 03:08 seanlaw

Can you help me understand what that means and if there is anything we could do?

So I did fastmath=False to remove its impact on imprecision if there is any. I also avoided using the rolling approach for calculating "mean" and "std" of subsequences. And.... I still got error.

I am still investigating this. I just wanted to update you and seek your opinion.

Aug 10 '22 03:08 NimaSarajpoor

stumpy stumpy copied to clipboard

Fix Precision in Unit Test

stumpy
stumpy copied to clipboard