aeon icon indicating copy to clipboard operation
aeon copied to clipboard

[BUG][Hidalgo] AssertionError: assert rmax > 0

Open kir0ul opened this issue 3 months ago • 6 comments

Describe the bug

I'm trying to use Hidalgo to segment our dataset, but I get an assert rmax > 0 error.

Steps/Code to reproduce the bug

import json
import pandas as pd
from aeon.segmentation import HidalgoSegmenter
from sklearn.preprocessing import MinMaxScaler

# Data subset
data_dict = json.loads('{"x":{"0":0.4257669507,"1":0.4253725033,"2":0.4252818198,"3":0.4252818198,"4":0.425349677,"5":0.4253740215,"6":0.4254784689,"7":0.4254784689,"8":0.4255383361,"9":0.4254318762,"10":0.4256996914,"11":0.4256130534,"12":0.4256130534,"13":0.4253865468,"14":0.4253865468,"15":0.4252407974,"16":0.4252407974,"17":0.4257405957,"18":0.4257405957,"19":0.4255358599,"20":0.4251156221,"21":0.4251156221,"22":0.4253575551,"23":0.4255880915,"24":0.4255663679,"25":0.4256628462,"26":0.4256628462,"27":0.4259410406,"28":0.4267114539,"29":0.4277039868,"30":0.4284604203,"31":0.4284604203,"32":0.4302800694,"33":0.4322460778,"34":0.4346705172,"35":0.4346705172,"36":0.4371808068,"37":0.4392193531,"38":0.4419787566,"39":0.44394467,"40":0.4455386315,"41":0.4455386315,"42":0.4484341572,"43":0.4507840657,"44":0.4507840657,"45":0.4532244633,"46":0.4549927733,"47":0.4576848292,"48":0.4610286743,"49":0.4663468167},"y":{"0":0.2433640871,"1":0.2433817587,"2":0.243515473,"3":0.243515473,"4":0.243330775,"5":0.2433764023,"6":0.2433758396,"7":0.2433758396,"8":0.2436249834,"9":0.2434208001,"10":0.2431829586,"11":0.243519011,"12":0.243519011,"13":0.2435338798,"14":0.2435338798,"15":0.2427958052,"16":0.2427958052,"17":0.2434832522,"18":0.2434832522,"19":0.243382383,"20":0.2433153482,"21":0.2433153482,"22":0.2434281222,"23":0.2436008387,"24":0.2429868484,"25":0.2437683391,"26":0.2437683391,"27":0.2439072948,"28":0.242919792,"29":0.2433073901,"30":0.2428429508,"31":0.2428429508,"32":0.2436557076,"33":0.2433159762,"34":0.2428919038,"35":0.2428919038,"36":0.2435177615,"37":0.2428567016,"38":0.2434209507,"39":0.2434055484,"40":0.2434657139,"41":0.2434657139,"42":0.2440847695,"43":0.2442286093,"44":0.2442286093,"45":0.2448578754,"46":0.2445213732,"47":0.2448515841,"48":0.2466578652,"49":0.2475578949},"z":{"0":0.9763806256,"1":0.976541671,"2":0.9766390753,"3":0.9766390753,"4":0.9761048068,"5":0.9765859558,"6":0.9765303357,"7":0.9765303357,"8":0.9765715424,"9":0.9762428674,"10":0.9767105624,"11":0.9763704237,"12":0.9763704237,"13":0.9764624393,"14":0.9764624393,"15":0.9762103707,"16":0.9762103707,"17":0.9764087797,"18":0.9764087797,"19":0.976436962,"20":0.9764482037,"21":0.9764482037,"22":0.9764192093,"23":0.9767144461,"24":0.9764186015,"25":0.9765317889,"26":0.9765317889,"27":0.9763026984,"28":0.9767263621,"29":0.9761456476,"30":0.9767367686,"31":0.9767367686,"32":0.976554542,"33":0.9760865449,"34":0.9762832482,"35":0.9762832482,"36":0.9763611054,"37":0.9761762628,"38":0.9762820997,"39":0.9757640754,"40":0.9758159129,"41":0.9758159129,"42":0.9757864022,"43":0.975320887,"44":0.975320887,"45":0.9757195712,"46":0.9755819628,"47":0.9758874947,"48":0.9760461699,"49":0.9754331639}}')

# Preprocess
X_df = pd.DataFrame(data_dict)
X_df_scaled = MinMaxScaler(feature_range=(0, 1)).fit_transform(X_df)

# Segmentation
hidalgo = HidalgoSegmenter(K=len(labels), q=3, n_iter=2000, burn_in=0.8)
cps = hidalgo.fit_predict(X_df_scaled, axis=0)
print("Found change points:", cps)

Expected results

No error is thrown

Actual results

[/home/user/Projects/segmentation-baselines/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py:176](http://localhost:8889/lab/tree/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py#line=175): RuntimeWarning: divide by zero encountered in divide
  mu = np.divide(distances[:, 2], distances[:, 1])

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[28], line 1
----> 1 cps = hidalgo.fit_predict(X_reduced_scaled, axis=0)
      2 print("Found change points:", cps)

File [~/Projects/segmentation-baselines/venv/lib64/python3.11/site-packages/aeon/segmentation/base.py:160](http://localhost:8889/lab/tree/venv/lib64/python3.11/site-packages/aeon/segmentation/base.py#line=159), in BaseSegmenter.fit_predict(self, X, y, axis)
    157 """Fit segmentation to data and return it."""
    158 # Non-optimized default implementation; override when a better
    159 # method is possible for a given algorithm.
--> 160 self.fit(X, y, axis=axis)
    161 return self.predict(X, axis=axis)

File [~/Projects/segmentation-baselines/venv/lib64/python3.11/site-packages/aeon/segmentation/base.py:122](http://localhost:8889/lab/tree/venv/lib64/python3.11/site-packages/aeon/segmentation/base.py#line=121), in BaseSegmenter.fit(self, X, y, axis)
    120 if y is not None:
    121     y = self._check_y(y)
--> 122 self._fit(X=X, y=y)
    123 self.is_fitted = True
    124 return self

File [~/Projects/segmentation-baselines/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py:607](http://localhost:8889/lab/tree/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py#line=606), in HidalgoSegmenter._fit(self, X, y)
    604 maxlik = -1e10
    606 for _ in range(n_replicas):
--> 607     sampling = self._gibbs_sampling(
    608         N,
    609         mu,
    610         Iin,
    611         Iout,
    612         Iout_count,
    613         Iout_track,
    614         V,
    615         NN,
    616         a1,
    617         b1,
    618         c1,
    619         Z,
    620         f1,
    621         N_in,
    622         _rng,
    623     )
    624     sampling = np.reshape(sampling, (n_iter, Npar))
    626     idx = [
    627         it
    628         for it in range(n_iter)
    629         if it % sampling_rate == 0 and it >= n_iter * burn_in
    630     ]

File [~/Projects/segmentation-baselines/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py:514](http://localhost:8889/lab/tree/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py#line=513), in HidalgoSegmenter._gibbs_sampling(self, N, mu, Iin, Iout, Iout_count, Iout_track, V, NN, a1, b1, c1, Z, f1, N_in, _rng)
    511     return lik0, lik1
    513 for it in range(n_iter):
--> 514     d = sample_d(K, a1, b1, _rng)
    515     sampling = np.append(sampling, d)
    517     (p, pp) = sample_p(K, p, pp, c1, _rng)

File [~/Projects/segmentation-baselines/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py:343](http://localhost:8889/lab/tree/venv/lib64/python3.11/site-packages/aeon/segmentation/_hidalgo.py#line=342), in HidalgoSegmenter._gibbs_sampling.<locals>.sample_d(K, a1, b1, _rng)
    340 rmax = (a1[k] - 1) [/](http://localhost:8889/) b1[k]
    342 if a1[k] - 1 > 0:
--> 343     assert rmax > 0
    344     frac = np.exp(
    345         -b1[k] * (r1 - rmax)
    346         - (a1[k] - 1) * (np.log(rmax) - np.log(r1))
    347     )
    348 else:

AssertionError:

Versions


System:
    python: 3.11.13 (main, Jun 09 2025, 17:26:24) [GCC]
executable: /home/user/Projects/segmentation-baselines/venv/bin/python3.11
   machine: Linux-6.15.8-1-default-x86_64-with-glibc2.41
Python dependencies:
         aeon: 1.2.0
          pip: 25.2
   setuptools: 65.5.0
 scikit-learn: 1.6.1
        numpy: 2.2.6
        numba: 0.61.2
        scipy: 1.15.3
       pandas: 2.3.1

kir0ul avatar Sep 05 '25 16:09 kir0ul

Thanks for the usage example, I will have a look soon hopefully and get back to you.

MatthewMiddlehurst avatar Sep 05 '25 21:09 MatthewMiddlehurst

Managed to reproduce this. Usage example does not provide labels but replacing k with 3 gives the same error. Data shape is 3 channels and 50 time points.

Unfortunately, I believe the contributor who implemented this back before we forked (@KatieBuc) is no longer active, and I do not have enough knowledge of the algorithm to diagnose this without putting in more time than I have available currently. The segmentation module as a whole is a bit ownerless currently.

Don't mind helping if someone wants to take a crack at it. Could be either a code bug or incorrect parameterisation which needs a better error.

MatthewMiddlehurst avatar Sep 11 '25 10:09 MatthewMiddlehurst

hi, I'm travelling this week so will try and take a look at this, thanks for the spot

TonyBagnall avatar Sep 15 '25 08:09 TonyBagnall

yes I can also reproduce this with that data, but not in all cases

X=np.random.rand(50, 3)
hidalgo = HidalgoSegmenter(K=3, q=3, n_iter=2000, burn_in=0.8)
cps = hidalgo.fit_predict(X, axis=0)
print("Found change points:", cps)

works. So may be a data driven edge case not handled properly. Will continue to dig. I dont like asserts tests like this, it should fail with some explanation as to why if its invalid, or not do it if its a bug :)

so caused by an infinity here

Image

TonyBagnall avatar Sep 15 '25 17:09 TonyBagnall

so its caused in the initialisation by this operation returning an infinite

        V = [sum(np.log(mu[[Z[i] == k for i in range(N)]])) for k in range(K)]

will look further, this code could do with a refactor ....

its in the mu calculation returning infinites

        N, mu, Iin, Iout, Iout_count, Iout_track = self._get_neighbourhood_params(X)
    mu : np.ndarray
        1D np.ndarray of length m. parameter in Pereto distribution estimated by
        ``r2/r1``

so r1 is zero at some point, so distances are zero in this line

        nbrs = NearestNeighbors(
            n_neighbors=q + 1, algorithm="ball_tree", metric=metric
        ).fit(X)
        distances, Iin = nbrs.kneighbors(X)
        mu = np.divide(distances[:, 2], distances[:, 1])

TonyBagnall avatar Sep 15 '25 17:09 TonyBagnall

so this is causing it

        distances, Iin = nbrs.kneighbors(X)
        mu = np.divide(distances[:, 2], distances[:, 1])

because distances[:1] contains zeros

Image

so I suspect its caused by duplicates in the training data leading to zero distances. Seems this is completely valid, so should not really do this division. We could just add a tolerance

den = distances[:, 1]
num = distances[:, 2]

# Safe divide: put NaN where denominator is zero
mu = np.divide(num, den, out=np.full_like(num, np.nan), where=den != 0)

# Or add a tiny epsilon if that’s acceptable for your model
eps = 1e-12
mu_eps = num / (den + eps)

This will at least stop the crash, or we can do something else, not sure what the hell is going on with this algorithm, who is Hidalgo anyway :) https://en.wikipedia.org/wiki/Hidalgo_(film)

TonyBagnall avatar Sep 15 '25 17:09 TonyBagnall