aeon icon indicating copy to clipboard operation
aeon copied to clipboard

[ENH] Added R-Clustering clusterer to aeon

Open Ramana-Raja opened this issue 1 year ago • 40 comments

Reference Issues/PRs

#2132

What does this implement/fix? Explain your changes.

added R clustering model for aeon

Does your contribution introduce a new dependency? If yes, which one?

no

Any other comments?

PR checklist

For all contributions
  • [ ] I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you.
  • [x] The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.
For new estimators and functions
  • [ ] I've added the estimator to the online API documentation.
  • [ ] (OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
For developers with write access
  • [ ] (OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.

Ramana-Raja avatar Nov 22 '24 22:11 Ramana-Raja

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ $\color{#FEF1BE}{\textsf{enhancement}}$ ]. I have added the following labels to this PR based on the changes made: [ $\color{#4011F3}{\textsf{clustering}}$ ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • [ ] Run pre-commit checks for all files
  • [ ] Run mypy typecheck tests
  • [ ] Run all pytest tests and configurations
  • [ ] Run all notebook example tests
  • [ ] Run numba-disabled codecov tests
  • [ ] Stop automatic pre-commit fixes (always disabled for drafts)
  • [ ] Disable numba cache loading
  • [ ] Push an empty commit to re-run CI checks

aeon-actions-bot[bot] avatar Nov 22 '24 22:11 aeon-actions-bot[bot]

hi, thanks for this but if we include this clusterer we want it to use our version of Rocket transformers which are optimised for numba

TonyBagnall avatar Nov 27 '24 08:11 TonyBagnall

hi, thanks for this but if we include this clusterer we want it to use our version of Rocket transformers which are optimised for numba

sure, I will try to reimplement it and use aeon Rocket transformers

Ramana-Raja avatar Nov 27 '24 08:11 Ramana-Raja

Hi @Ramana-Raja thanks for this PR! Just looked at your code and it looks like it follow the original. However, the original code is a reimplementation of MINIROCKET. We don't really want to have another copy of MINIROCKET in the repository to maintain. See: https://www.aeon-toolkit.org/en/stable/examples/transformations/minirocket.html

When we were thinking about implementing R-Clustering we wanted to try find a way to use MINIROCKET (or at least some of its functions). I haven't looked into seeing how possible this is. Would it be possible to use MINIROCKET or some of its functions in this implementation?

chrisholder avatar Nov 27 '24 10:11 chrisholder

Hi @Ramana-Raja thanks for this PR! Just looked at your code and it looks like it follow the original. However, the original code is a reimplementation of MINIROCKET. We don't really want to have another copy of MINIROCKET in the repository to maintain. See: https://www.aeon-toolkit.org/en/stable/examples/transformations/minirocket.html

When we were thinking about implementing R-Clustering we wanted to try find a way to use MINIROCKET (or at least some of its functions). I haven't looked into seeing how possible this is. Would it be possible to use MINIROCKET or some of its functions in this implementation?

Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak

Ramana-Raja avatar Nov 27 '24 11:11 Ramana-Raja

Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak

R-Clustering uses MiniRocket with 3 or so small modifications (They comment in the original #modification for their changes). I want to try reuse the current implementation and functions as much as possible. This may NOT be possible but I would like to try reuse as much code as possible.

chrisholder avatar Nov 27 '24 11:11 chrisholder

Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak

R-Clustering uses MiniRocket with 3 or so small modifications (They comment in the original #modification for their changes). I want to try reuse the current implementation and functions as much as possible. This may NOT be possible but I would like to try reuse as much code as possible.

Got it! I’ll do my best to reuse as much of the existing code as possible.

Ramana-Raja avatar Nov 27 '24 11:11 Ramana-Raja

Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak

R-Clustering uses MiniRocket with 3 or so small modifications (They comment in the original #modification for their changes). I want to try reuse the current implementation and functions as much as possible. This may NOT be possible but I would like to try reuse as much code as possible.

Hi, I've been trying to reuse as much code as possible, but I’ve noticed that some of the tests are failing. I've attempted to fix the issues, but haven't had any success so far. Do you have any ideas on what might be causing the failures? Any recommendations on what I should check or approaches I could try?

Ramana-Raja avatar Nov 30 '24 13:11 Ramana-Raja

@chrisholder @TonyBagnall Tests seem to be passing here and changes made, good to take a look when you have the time.

MatthewMiddlehurst avatar Dec 04 '24 10:12 MatthewMiddlehurst

Hi, @TonyBagnall @chrisholder I've implemented all the required changes and ensured that the tests are passing successfully. Could you please review the updates and let me know if there are any additional adjustments or improvements needed?

Ramana-Raja avatar Dec 11 '24 20:12 Ramana-Raja

Hi @Ramana-Raja thanks for working on this! Sorry it's taking me a while to look at this. I will look at this tomorrow (friday) and get back to you! Hopefully we can try merge it over this weekend/early next week!

chrisholder avatar Dec 12 '24 17:12 chrisholder

Hi @Ramana-Raja thanks for working on this! Sorry it's taking me a while to look at this. I will look at this tomorrow (friday) and get back to you! Hopefully we can try merge it over this weekend/early next week!

No problem at all, and thank you for the update! Please take your time to review.

Ramana-Raja avatar Dec 12 '24 20:12 Ramana-Raja

@MatthewMiddlehurst can you have a look at this error I'm unsure why it is happening. How did you get this to work with the other feature based clusterers?

chrisholder avatar Dec 14 '24 11:12 chrisholder

Hi @chrisholder, I discovered that the issue was caused by how self._random_state = check_random_state(self.random_state) was being handled. Removing it resolved the problem, though I'm not entirely sure why. Perhaps @MatthewMiddlehurst could provide some insights. Additionally, I removed the estimator as an input for the class, as the research paper only utilized KMeans

Ramana-Raja avatar Dec 17 '24 11:12 Ramana-Raja

Looks good - Final thing before approving have you checked equivalence to the original. It's doesnt have to be completely 100% the same (due to random state). However, we should check for 10 or so datasets (which you can load from aeon.datasets) that the scores are similar.

This doesn't need to be an actual test in aeon but if you could run it locally and post the adjusted rand index using sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) the we can check this version matches the original!

chrisholder avatar Dec 17 '24 17:12 chrisholder

Looks good - Final thing before approving have you checked equivalence to the original. It's doesnt have to be completely 100% the same (due to random state). However, we should check for 10 or so datasets (which you can load from aeon.datasets) that the scores are similar.

This doesn't need to be an actual test in aeon but if you could run it locally and post the adjusted rand index using sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) the we can check this version matches the original!

I tested the ARI scores between the this RCluster and the original RCluster on the experimental dataset provided in the original RCluster code. Here are the results: ari_scores

Ramana-Raja avatar Dec 18 '24 17:12 Ramana-Raja

@chrisholder While reviewing the code, I identified an architectural issue. The R cluster implementation cannot have separate fit and predict methods because it relies on PCA for dimensionality reduction. For example, if PCA determines the optimal dimension during training to be 13, and we attempt to predict on test data with fewer than 13 dimensions, it results in an error. Even if we create a new PCA and apply fit_transform on the test data, we would need to retrain KMeans, which is what I did. However, I don't believe this is an optimal solution. Could you suggest a better approach to handle this scenario?

Ramana-Raja avatar Dec 31 '24 17:12 Ramana-Raja

@chrisholder While reviewing the code, I identified an architectural issue. The R cluster implementation cannot have separate fit and predict methods because it relies on PCA for dimensionality reduction. For example, if PCA determines the optimal dimension during training to be 13, and we attempt to predict on test data with fewer than 13 dimensions, it results in an error. Even if we create a new PCA and apply fit_transform on the test data, we would need to retrain KMeans, which is what I did. However, I don't believe this is an optimal solution. Could you suggest a better approach to handle this scenario?

PCA has a separate fit and transform step, so you can PCA fit_transform in fit, save the transform then just PCA transform in transform?

TonyBagnall avatar Dec 31 '24 18:12 TonyBagnall

@chrisholder While reviewing the code, I identified an architectural issue. The R cluster implementation cannot have separate fit and predict methods because it relies on PCA for dimensionality reduction. For example, if PCA determines the optimal dimension during training to be 13, and we attempt to predict on test data with fewer than 13 dimensions, it results in an error. Even if we create a new PCA and apply fit_transform on the test data, we would need to retrain KMeans, which is what I did. However, I don't believe this is an optimal solution. Could you suggest a better approach to handle this scenario?

PCA has a separate fit and transform step, so you can PCA fit_transform in fit, save the transform then just PCA transform in transform?

If we transform test data(while predicting that is predicting this test data without fitting it using _predict method) with a number of features less than the n_components of PCA, it will cause an error, So we cannot predict test data having number of features less than that of n_components of PCA.

Ramana-Raja avatar Dec 31 '24 19:12 Ramana-Raja

Please respond to the reviews in text instead of just resolving them unless it is a simple change. Feel free to ask for clarification. There are still issues IMO.

If we transform test data(while predicting that is predicting this test data without fitting it using _predict method) with a number of features less than the n_components of PCA, it will cause an error, So we cannot predict test data having number of features less than that of n_components of PCA.

Why would there be less features in predict?

MatthewMiddlehurst avatar Jan 09 '25 14:01 MatthewMiddlehurst

Please respond to the reviews in text instead of just resolving them unless it is a simple change. Feel free to ask for clarification. There are still issues IMO.

If we transform test data(while predicting that is predicting this test data without fitting it using _predict method) with a number of features less than the n_components of PCA, it will cause an error, So we cannot predict test data having number of features less than that of n_components of PCA.

Why would there be less features in predict?

Considering the case where we train the clusterer on a large dataset but only need to make predictions on a smaller dataset with fewer features

Ramana-Raja avatar Jan 09 '25 18:01 Ramana-Raja

Bit confused on what you mean by fewer features. The algorithm explicitly does not allow unequal length series in the tags.

Also, this appears to be done after the ROCKET transform, so I don't think it would have a different number of features either way?

MatthewMiddlehurst avatar Jan 11 '25 00:01 MatthewMiddlehurst

Bit confused on what you mean by fewer features. The algorithm explicitly does not allow unequal length series in the tags.

Also, this appears to be done after the ROCKET transform, so I don't think it would have a different number of features either way?

Yes, I was referring to the data after it has been transformed by ROCKET. If the number of features ends up being lower than PCA's n_component, it can cause some issues. I fixed it by training a new KMeans model, but do you have any suggestions for a better method?(It was one of error caused in testing,when I did not add this exception handling)

Ramana-Raja avatar Jan 11 '25 21:01 Ramana-Raja

ROCKET should not be producing a different amount of features. No fitting should be going on in predict here I feel.

MatthewMiddlehurst avatar Jan 11 '25 22:01 MatthewMiddlehurst

ROCKET should not be producing a different amount of features. No fitting should be going on in predict here I feel.

Removing the fitting step in the predict function,results in the following error: ValueError: n_components=9 must be between 0 and min(n_samples, n_features)=5 with svd_solver='full'(in testing as you can see below). I thought this could be resolved by fitting a new KMeans model. However, if you have a better approach, could you please share it

Ramana-Raja avatar Jan 12 '25 09:01 Ramana-Raja

I dont really have the capacity to debug this currently. It still looks like there are transforms being fit in predict, which again does not look correct to me. If ROCKET is producing different size feature vectors that is incorrect.

MatthewMiddlehurst avatar Feb 12 '25 23:02 MatthewMiddlehurst

@MatthewMiddlehurst I've resolved the PCA issue, and all test cases are now passing. I also added random_state in test case default param, similar to other clustering models in Aeon, to fix the test case error where estimator.labels_ was not matching estimator.fit(data). If you have some time, could you review the code and let me know if any improvements are needed?

Ramana-Raja avatar Mar 02 '25 15:03 Ramana-Raja

@MatthewMiddlehurst I've resolved the PCA issue, and all test cases are now passing. I also added random_state in test case default param, similar to other clustering models in Aeon, to fix the test case error where estimator.labels_ was not matching estimator.fit(data). If you have some time, could you review the code and let me know if any improvements are needed?

Hi @MatthewMiddlehurst , just checking in, kindly following up on this PR when you have a moment.

Ramana-Raja avatar Apr 04 '25 16:04 Ramana-Raja

Hi, this is a complex PR and the project is currently very busy so it is unlikely this will be in soon. I have left a few comments but I don't imagine they will be the last.

I see you linked some results above at some point, but This seems to be for just one of the estimators? Not sure if that is this or the original code. One of the things I am going to ask for before merging is a comparison for both this and the original so that is also something you can do.

The results were comparing how accurate or similar the original implementation is to this one. And for some reason if I dont set random state the test case fail at "assert np.array_equal(estimator.labels_, estimator.predict(data))", I am not sure why. btw, thanks for taking the time to review this.

Ramana-Raja avatar Apr 04 '25 19:04 Ramana-Raja

I see the image you posted, but it only has one column of ARI scores. I'm not sure what those scores are for, the clusters produced by your or produced by the original. We would want them for both so we can compare.

Test failure seems legit. _labels and predict output should be the same.

MatthewMiddlehurst avatar Apr 04 '25 22:04 MatthewMiddlehurst