scikit-tree
scikit-tree copied to clipboard
Implementation of general MORF trees (Gabor, Fourier, Gaussian, and generic kernels)
Basic MORF implementation
Need to implement:
- Cython splitter: Patch splitter (discontiguous and contiguous)
- Cython tree: I think we don't even need a new Cython tree. We can just use the ObliqueTree class
- Python interface: Implement a new Python API for morf
The main difference in the MORF splitter is that we need to take into account the shapes of the input dataset. The hardest thing to implement is the splitter. We just need to replicate the splitter in the MORF paper (weights of 1). Once we have that, the next set of goals will be to implement a more exotic splitter, where the weights and the shape of the patch is some kernel function (e.g. Gabor).
Advanced MORF implementation
I think the first thing we should try is the Gabor-tree. To my knowledge there is no robust C++/C/Cython implementation of Gabor filter generation. Therefore, what we can do is generate a "filter-bank" in Python using scikit-image. Then we pass this filter-bank from Python via read-only memoryviews into Cython.
The open technical coding question is how to do this efficiently. You cannot pass a list of ndarrays as this is not supported in Cython. There are a bunch of possible solns. floated here https://stackoverflow.com/questions/50439908/array-of-memoryviews-in-cython.
Ref: https://scikit-image.org/docs/stable/auto_examples/features_detection/plot_gabor.html
Previous discussion: https://github.com/NeuroDataDesign/manifold_random_forests/issues/3
Misc thoughts
Rather than apply random projections using a kernel, we could also take each randomly chosen kernel and apply an entire convolution on the data. (https://github.com/dagss/cython-tutorial/blob/master/examples/convolve1/convolve.py). This would result in a new representation of the data for each kernel, and could be used to determine the "best kernels". One might for example take the resulting kernel convolution outputs and sort them based on which one produces the best criterion metric (i.e. differentiation among the classes). This is more expensive because the criterion needs to loop over a bigger vector, but may be worth it for the sake of choosing the "best kernels".
Another approach for making the implementation fast/robust enough is to somehow create a dependency on openCV and leverage their C++ code.
@jovo the pseudocode I am thinking of to implement a Gabor tree is follows. Some feedback and suggestions of who might be able to assist on technical details would be helpful
- define grid/range of parameters for a certain filter (i.e. Gabor)
- sample these filters to construct our filter bank
- at each node compute split candidates by:
- compute random (x,y) location on the data = "data-patch"
- apply random filter from filter bank onto the data-patch
- store feature value
- repeat for
mtry - pick the best split based on some criterion, storing (x,y) location and the filter from filter bank
- repeat and build tree
The limitations of the approach:
- picking a random (x,y) location on the data: do we stay away from the edges of the image?
- I know of no good way to generate the filters within Cython on the fly, so these need to somehow be passed in by reference from Python. Gonna be difficult to implement
An API we can take is:
- generate random (large) kernel library for each tree within
_build_treeat the Python level - pass this list of numpy arrays to the Cython splitter. See [1]_.
- The Cython splitter holds onto this list of numpy arrays now using C++ vector and maintains pointers to the actual data
- The Cython splitter at random applies these at each split node
mtrytimes and proceeds as normal. The rest of the tree building is essentially the same.
This basically means we have to do a lot of kernel library generation before Cython is called. Assuming those generation functions are fast, shouldn't be too big of an issue. Probably will incur a large RAM cost tho, but unavoidable for now.
.. [1] https://stackoverflow.com/questions/46240207/passing-list-of-numpy-arrays-to-c-using-cython
To implement this, we want to implement a Cython splitter that takes in:
- lists of numpy arrays of shape (n_patch, n_patch), which are the "kernels" that weight the image patch it is applied to
- discontiguous dimensions
Constraints:
- it only works on 2D inputs for now, but can follow the n-dimensinoal API that PatchObliqueSplitter has
- the splitter simply samples the top-left seed point to apply each kernel to the patch that results; the top-left seed can be sampled in the same way as the existing PatchObliqueSplitter does: either wrapped, or bounded.
- the splitter saves the index of the list of kernels that is uses
https://github.com/angus924/rocket https://github.com/angus924/minirocket/tree/main/code
and the related papers would be interesting to compare
Still interested in this. And then, as a next step, we start with a prior distribution over atoms in the dictionary, and then, as we add trees (eg, a batch of 100), we update the prior with feature importance scores, and then build another batch. After we have enough trees, we start discarding the original ones whenever we build a new batch.