aeon
aeon copied to clipboard
[MNT] Update similarity search with new base classes : Query Search
Reference Issues/PRs
Part of #1243
What does this implement/fix? Explain your changes.
As described in #1243, this is a first PR that implement base classes for query search task, and transfers the existing code under this new submodule.
Remaining TODOs:
- [ ] Adapt distance profile function for unequal length
- [ ] Add tests for unequal length for both dummy and top-k
- [ ] Fix missing docstrings
- [ ] Add typing to function/class parameters
PR checklist
For new estimators and functions
- [X] I've added the estimator to the online API documentation.
- [X] (OPTIONAL) I've added myself as a
__maintainer__at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Thank you for contributing to aeon
I have added the following labels to this PR based on the title: [ $\color{#EC843A}{\textsf{maintenance}}$ ]. I have added the following labels to this PR based on the changes made: [ $\color{#45FD64}{\textsf{examples}}$, $\color{#006b75}{\textsf{similarity search}}$ ]. Feel free to change these if they do not properly represent the PR.
The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.
If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.
Don't hesitate to ask questions on the aeon Slack channel if you have any.
Any reason for the new base class direction? Do they other base classes not have much shared functionality?
Any reason for the new base class direction? Do they other base classes not have much shared functionality?
Yeah, after doing some pen and paper for the other two classes, the base class for similarity search would be pretty much empty. For example, index search will actually do things differently during fit as it need to build a similarity model, while series-search in the most naive way (without computational optimisations like MP) is simply looping over a query search for all possible candidates. The computational optimisations do require some rethinking of the fit also compared to query search. So if we consider the three submodule, at least for now, I don't see a use for BaseSimilaritySearch class. Still possible to refactor afterward if we find a good reason to.
On second thought, we could call the preprocess function of the CollectionEstimator for the 3D data given during fit method in a BaseSimilaritySearch, but that's about it ... Would that be better for structure’s sake's ?
Couple of things to consider but ofc can change later as you say.
If there are a significant number of shared parameters/attributes/functions used then it may be a good idea to keep even if its mostly abstract methods.
Also there may be some situations where you want to use isinstance to cover all of them? Maybe not also, not thought about it that much 🙂.
If there are a significant number of shared parameters/attributes/functions used then it may be a good idea to keep even if its mostly abstract methods.
For some reason, I forgot to consider this ... I'm a bit out of touch today ! Swapping structure to add it back with the adjustment.
Noticed some issue with the base class structure for the case of #1311, where the optimisation relies on lower bounding and not returning the distance profile fully computed, so I'll revamp it to be useable for all type of optimisations.
Sorry for the mess in this PR ... To summarize :
The previous BaseQuerySearch class was made to allow any matching condition on the output of the similarity search (e.g. top k matches, all matches below a distance threshold, ...). In practice, as there are only few plausible conditions: top k and/or threshold and worse-k and/or threshold, I just made a QuerySearch estimator with k, threshold and inverse_distance parameter that cover all these cases.
Additionally, there was the problem of computational optimisations that only compute part of the distance profiles (e.g. dtw lower bounding). These optimisations need to know the matching condition (top-k, ...) to work. The previous structure would not have allowed that, as distance profile computations were happening in the BaseQuerySearch class, which didn't have access to the matching condition.