aeon
aeon copied to clipboard
[ENH] Added R-Clustering clusterer to aeon
Reference Issues/PRs
#2132
What does this implement/fix? Explain your changes.
added R clustering model for aeon
Does your contribution introduce a new dependency? If yes, which one?
no
Any other comments?
PR checklist
For all contributions
- [ ] I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you.
- [x] The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.
For new estimators and functions
- [ ] I've added the estimator to the online API documentation.
- [ ] (OPTIONAL) I've added myself as a
__maintainer__at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
For developers with write access
- [ ] (OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.
Thank you for contributing to aeon
I have added the following labels to this PR based on the title: [ $\color{#FEF1BE}{\textsf{enhancement}}$ ]. I have added the following labels to this PR based on the changes made: [ $\color{#4011F3}{\textsf{clustering}}$ ]. Feel free to change these if they do not properly represent the PR.
The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.
If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.
Don't hesitate to ask questions on the aeon Slack channel if you have any.
PR CI actions
These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.
- [ ] Run
pre-commitchecks for all files - [ ] Run
mypytypecheck tests - [ ] Run all
pytesttests and configurations - [ ] Run all notebook example tests
- [ ] Run numba-disabled
codecovtests - [ ] Stop automatic
pre-commitfixes (always disabled for drafts) - [ ] Disable numba cache loading
- [ ] Push an empty commit to re-run CI checks
hi, thanks for this but if we include this clusterer we want it to use our version of Rocket transformers which are optimised for numba
hi, thanks for this but if we include this clusterer we want it to use our version of Rocket transformers which are optimised for numba
sure, I will try to reimplement it and use aeon Rocket transformers
Hi @Ramana-Raja thanks for this PR! Just looked at your code and it looks like it follow the original. However, the original code is a reimplementation of MINIROCKET. We don't really want to have another copy of MINIROCKET in the repository to maintain. See: https://www.aeon-toolkit.org/en/stable/examples/transformations/minirocket.html
When we were thinking about implementing R-Clustering we wanted to try find a way to use MINIROCKET (or at least some of its functions). I haven't looked into seeing how possible this is. Would it be possible to use MINIROCKET or some of its functions in this implementation?
Hi @Ramana-Raja thanks for this PR! Just looked at your code and it looks like it follow the original. However, the original code is a reimplementation of MINIROCKET. We don't really want to have another copy of MINIROCKET in the repository to maintain. See: https://www.aeon-toolkit.org/en/stable/examples/transformations/minirocket.html
When we were thinking about implementing R-Clustering we wanted to try find a way to use MINIROCKET (or at least some of its functions). I haven't looked into seeing how possible this is. Would it be possible to use MINIROCKET or some of its functions in this implementation?
Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak
Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak
R-Clustering uses MiniRocket with 3 or so small modifications (They comment in the original #modification for their changes). I want to try reuse the current implementation and functions as much as possible. This may NOT be possible but I would like to try reuse as much code as possible.
Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak
R-Clustering uses MiniRocket with 3 or so small modifications (They comment in the original #modification for their changes). I want to try reuse the current implementation and functions as much as possible. This may NOT be possible but I would like to try reuse as much code as possible.
Got it! I’ll do my best to reuse as much of the existing code as possible.
Just to confirm, you’re asking if MINIROCKET or its functions are being used here? I’ve already updated the code in my PR to use the existing MINIROCKET functions. Let me know if there’s anything else you’d like me to tweak
R-Clustering uses MiniRocket with 3 or so small modifications (They comment in the original #modification for their changes). I want to try reuse the current implementation and functions as much as possible. This may NOT be possible but I would like to try reuse as much code as possible.
Hi, I've been trying to reuse as much code as possible, but I’ve noticed that some of the tests are failing. I've attempted to fix the issues, but haven't had any success so far. Do you have any ideas on what might be causing the failures? Any recommendations on what I should check or approaches I could try?
@chrisholder @TonyBagnall Tests seem to be passing here and changes made, good to take a look when you have the time.
Hi, @TonyBagnall @chrisholder I've implemented all the required changes and ensured that the tests are passing successfully. Could you please review the updates and let me know if there are any additional adjustments or improvements needed?
Hi @Ramana-Raja thanks for working on this! Sorry it's taking me a while to look at this. I will look at this tomorrow (friday) and get back to you! Hopefully we can try merge it over this weekend/early next week!
Hi @Ramana-Raja thanks for working on this! Sorry it's taking me a while to look at this. I will look at this tomorrow (friday) and get back to you! Hopefully we can try merge it over this weekend/early next week!
No problem at all, and thank you for the update! Please take your time to review.
@MatthewMiddlehurst can you have a look at this error I'm unsure why it is happening. How did you get this to work with the other feature based clusterers?
Hi @chrisholder, I discovered that the issue was caused by how self._random_state = check_random_state(self.random_state) was being handled. Removing it resolved the problem, though I'm not entirely sure why. Perhaps @MatthewMiddlehurst could provide some insights. Additionally, I removed the estimator as an input for the class, as the research paper only utilized KMeans
Looks good - Final thing before approving have you checked equivalence to the original. It's doesnt have to be completely 100% the same (due to random state). However, we should check for 10 or so datasets (which you can load from aeon.datasets) that the scores are similar.
This doesn't need to be an actual test in aeon but if you could run it locally and post the adjusted rand index using sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) the we can check this version matches the original!
Looks good - Final thing before approving have you checked equivalence to the original. It's doesnt have to be completely 100% the same (due to random state). However, we should check for 10 or so datasets (which you can load from aeon.datasets) that the scores are similar.
This doesn't need to be an actual test in aeon but if you could run it locally and post the adjusted rand index using sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) the we can check this version matches the original!
I tested the ARI scores between the this RCluster and the original RCluster on the experimental dataset provided in the original RCluster code. Here are the results:
@chrisholder While reviewing the code, I identified an architectural issue. The R cluster implementation cannot have separate fit and predict methods because it relies on PCA for dimensionality reduction. For example, if PCA determines the optimal dimension during training to be 13, and we attempt to predict on test data with fewer than 13 dimensions, it results in an error. Even if we create a new PCA and apply fit_transform on the test data, we would need to retrain KMeans, which is what I did. However, I don't believe this is an optimal solution. Could you suggest a better approach to handle this scenario?
@chrisholder While reviewing the code, I identified an architectural issue. The R cluster implementation cannot have separate fit and predict methods because it relies on PCA for dimensionality reduction. For example, if PCA determines the optimal dimension during training to be 13, and we attempt to predict on test data with fewer than 13 dimensions, it results in an error. Even if we create a new PCA and apply fit_transform on the test data, we would need to retrain KMeans, which is what I did. However, I don't believe this is an optimal solution. Could you suggest a better approach to handle this scenario?
PCA has a separate fit and transform step, so you can PCA fit_transform in fit, save the transform then just PCA transform in transform?
@chrisholder While reviewing the code, I identified an architectural issue. The R cluster implementation cannot have separate fit and predict methods because it relies on PCA for dimensionality reduction. For example, if PCA determines the optimal dimension during training to be 13, and we attempt to predict on test data with fewer than 13 dimensions, it results in an error. Even if we create a new PCA and apply fit_transform on the test data, we would need to retrain KMeans, which is what I did. However, I don't believe this is an optimal solution. Could you suggest a better approach to handle this scenario?
PCA has a separate fit and transform step, so you can PCA fit_transform in fit, save the transform then just PCA transform in transform?
If we transform test data(while predicting that is predicting this test data without fitting it using _predict method) with a number of features less than the n_components of PCA, it will cause an error, So we cannot predict test data having number of features less than that of n_components of PCA.
Please respond to the reviews in text instead of just resolving them unless it is a simple change. Feel free to ask for clarification. There are still issues IMO.
If we transform test data(while predicting that is predicting this test data without fitting it using _predict method) with a number of features less than the n_components of PCA, it will cause an error, So we cannot predict test data having number of features less than that of n_components of PCA.
Why would there be less features in predict?
Please respond to the reviews in text instead of just resolving them unless it is a simple change. Feel free to ask for clarification. There are still issues IMO.
If we transform test data(while predicting that is predicting this test data without fitting it using _predict method) with a number of features less than the n_components of PCA, it will cause an error, So we cannot predict test data having number of features less than that of n_components of PCA.
Why would there be less features in predict?
Considering the case where we train the clusterer on a large dataset but only need to make predictions on a smaller dataset with fewer features
Bit confused on what you mean by fewer features. The algorithm explicitly does not allow unequal length series in the tags.
Also, this appears to be done after the ROCKET transform, so I don't think it would have a different number of features either way?
Bit confused on what you mean by fewer features. The algorithm explicitly does not allow unequal length series in the tags.
Also, this appears to be done after the ROCKET transform, so I don't think it would have a different number of features either way?
Yes, I was referring to the data after it has been transformed by ROCKET. If the number of features ends up being lower than PCA's n_component, it can cause some issues. I fixed it by training a new KMeans model, but do you have any suggestions for a better method?(It was one of error caused in testing,when I did not add this exception handling)
ROCKET should not be producing a different amount of features. No fitting should be going on in predict here I feel.
ROCKET should not be producing a different amount of features. No fitting should be going on in predict here I feel.
Removing the fitting step in the predict function,results in the following error: ValueError: n_components=9 must be between 0 and min(n_samples, n_features)=5 with svd_solver='full'(in testing as you can see below). I thought this could be resolved by fitting a new KMeans model. However, if you have a better approach, could you please share it
I dont really have the capacity to debug this currently. It still looks like there are transforms being fit in predict, which again does not look correct to me. If ROCKET is producing different size feature vectors that is incorrect.
@MatthewMiddlehurst I've resolved the PCA issue, and all test cases are now passing. I also added random_state in test case default param, similar to other clustering models in Aeon, to fix the test case error where estimator.labels_ was not matching estimator.fit(data). If you have some time, could you review the code and let me know if any improvements are needed?
@MatthewMiddlehurst I've resolved the PCA issue, and all test cases are now passing. I also added random_state in test case default param, similar to other clustering models in Aeon, to fix the test case error where estimator.labels_ was not matching estimator.fit(data). If you have some time, could you review the code and let me know if any improvements are needed?
Hi @MatthewMiddlehurst , just checking in, kindly following up on this PR when you have a moment.
Hi, this is a complex PR and the project is currently very busy so it is unlikely this will be in soon. I have left a few comments but I don't imagine they will be the last.
I see you linked some results above at some point, but This seems to be for just one of the estimators? Not sure if that is this or the original code. One of the things I am going to ask for before merging is a comparison for both this and the original so that is also something you can do.
The results were comparing how accurate or similar the original implementation is to this one. And for some reason if I dont set random state the test case fail at "assert np.array_equal(estimator.labels_, estimator.predict(data))", I am not sure why. btw, thanks for taking the time to review this.
I see the image you posted, but it only has one column of ARI scores. I'm not sure what those scores are for, the clusters produced by your or produced by the original. We would want them for both so we can compare.
Test failure seems legit. _labels and predict output should be the same.