aeon
aeon copied to clipboard
[ENH] BroadcasterTransformer to broadcast a series transformer over a collection
Currently, BaseTransformers can handle both single series and collections by internally "Vectorize" (i.e. broadcast) across channels and instances depending on the input data. This is legacy code which to me very confusing, involves huge amounts of redundant checks and is not maintainable by the current team. We should imo move away from this model.
This is a quick mock up of my favoured solution, for comment. I've not tested it properly yet, putting it up before I go too far down the line. Basically have a BroadcastTransformer that we pass a BaseSeriesTransformer which is applied over all instances in a collection.
Main points
- It takes all its tags from the series transformer. These should then be checked in the base class
- It does not broadcast over channels. If the SeriesTransformer cannot handle multivariate, nor can the Broadcaster.
- If fit is empty is true in the SeriesTransformer, then Broadcaster only uses a single instance of the SeriesTransformer. If fit is empty is false, it clones it n_cases times and stores the transformers
- it calls the private fit/transform of the SeriesTransformer. The assumption is that if it passes checks in the BaseCollectionTransformer, and the tags are taken from SeriesTransformer, then this should be fine. I think this should be true. If its not, we should make it so.
Thank you for contributing to aeon
I have added the following labels to this PR based on the title: [ $\color{#FEF1BE}{\textsf{enhancement}}$ ]. I would have added the following labels to this PR based on the changes made: [ $\color{#41A8F6}{\textsf{transformations}}$ ], however some package labels are already present.
The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.
If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.
Don't hesitate to ask questions on the aeon Slack channel if you have any.
A drawback of our current approach is that a transformer that needs to be fitted could only be broadcasted to a set of samples from the same size. For example if we had X1.shape = (5, 1, 100) used for fit and then transform, a set X2.shape = (10, 1, 100) could not be transformed, as we stored the 5 transformer fitted on X1.
But I don't know if this would ever happen considering we are broadcasting a single series transformer, which by essence (if fit is not empty) fits and transform one series.
Here is a draft for using joblib, need to add some tests and sort out tags for input format, as for now the example using MatrixProfileSeriesTransformer fails because it didn't convert the series to univariate as expected
To-dos left for this PR before it is ready for review:
- [x] Fix broadcaster test, seem to be failing some check in cases where
fit_empty=Truein the broadcasted series transformer - [x] Add new broadcaster transformer to the docs
- [x] Add an example notebook on how to use the broadcaster for different use cases
- [x] Add type hints when not a union of lot of possible input types
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
included the coverage tag, just to make sure most of it is happening correctly
Thanks for reviewing ! Concerning your questions:
So is the essentially intended to be the parallel of https://github.com/aeon-toolkit/aeon/blob/main/aeon/transformations/collection/_collection_wrapper.py?
Isn't _collection_wrapper.py a Collection to Series while the introduction of this PR is a Series to Collection ? (In the Collection to Series cases, there is no need for parallel I think ?)
I think we should attempt to unify design a bit for these, the collection version currently does not use the series transformer base class. Don't have to change the collection version in this PR of course. I prefer
CollectionToSeriesWrapperor similar to the current name.
I agree, both Collection to Series and Series to Collection transformers should be close to each other.
So It would be SeriesToCollectionWrapper and CollectionToSeriesWrapper (for _collection_wrapper.py, already implemented)
Some notes on the updates :
- Removed axis from Matrix Profile as it is expecting univariate data
- Added data generation for 1D/2D series in testing utils
- Added tests for BaseSeriesTransformer
- Removed axis as input parameter for Dummies, this parameter should stay fixed as it represent the expected shape for the implemented methods
- Modified dummy to affect random value per channel to test axis conversion
- Renamed Broadcaster to SeriesToCollectionWrapper
- Return the transformed data to the shape of the specified axis parameter in transform
A drawback of our current approach is that a transformer that needs to be fitted could only be broadcasted to a set of samples from the same size. For example if we had
X1.shape = (5, 1, 100)used for fit and then transform, a setX2.shape = (10, 1, 100)could not be transformed, as we stored the 5 transformer fitted onX1. But I don't know if this would ever happen considering we are broadcasting a single series transformer, which by essence (if fit is not empty) fits and transform one series.
I think if the transformer has a _fit function, then it needs to have been called on any case passed. I think calling transform on its own is not allowed, since is_fitted will be false
taking another look here, I'll take a look at tags. Firstly though, I'm not sure about the name series_wrapper. To me at least, a wrapper contains a single object, whereas this broadcasts. Same with class name SeriesToCollectionWrapper, it seems to be doing to much in the name. Wrappers to me are simply containers for another object. This comment has " "Wrap a single series transformer over a collection." Which doesnt read properly to me. I prefer """Broadcast a single series transformer over a collection."
looking at the tags, I think the only ones we dont want to take from the series transformer are
_tags = { "input_data_type": "Collection", "output_data_type": "Collection", "X_inner_type": "numpy3D", # or whatever the subclass sets it to "capability:multithreading": False,# or whatever the subclass sets it to
the other tags set in BaseCollectionTransformer are
# tag values specific to CollectionTransformers
_tags = {
"input_data_type": "Collection",
"output_data_type": "Collection",
"fit_is_empty": False,
"requires_y": False,
"capability:inverse_transform": False,
}
and in BaseCollectionEstimator are _tags = { "capability:multivariate": False, "capability:unequal_length": False, "capability:missing_values": False, "X_inner_type": "numpy3D", "capability:multithreading": False, "python_version": None, # PEP 440 python version specifier to limit versions } I think all of these can be taken from the SeriesTransformer, it if has them. I cant think of a case when we would want them different in the collection. Except multithreading? Not looked at that yet. I would prefer to exclude these than maintain a list of _tags_to_inherit, which seems fragile to me. If a series transformer has a tag specific to them, there is no need for the Broadcaster to know about it. It does need to know the tags it insists on though
also, we enforce calling fit by setting is fitted to true in the base class fit not the constructor
so I think we have some issues with threading about cloning/not cloning and the old problem of where to set n_jobs. I propose that we get it in without threading first, then raise an issue and describe the problems @baraline @MatthewMiddlehurst
Agreed, then we can benchmark it to select the best solution, I'll try to make the modification tomorrow if you didn't do it before me !
I'll push some changes, but they are open for discussion
there is in fact a problem calling _fit, since the X_inner_type of collections and series are fundamentally different, will unfortunately have to call base class fit
replaced with #1632