linfa
linfa copied to clipboard
Feat: implement Random Projections
This PR implements random projection techniques for dimensionality reduction, as seen in the sklean.random_projection module of scikit-learn
Codecov Report
Attention: Patch coverage is 8.79121% with 83 lines in your changes are missing coverage. Please review.
Project coverage is 35.87%. Comparing base (
4e40ce6) to head (6b9c2a4).
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #332 +/- ##
==========================================
- Coverage 36.18% 35.87% -0.32%
==========================================
Files 92 96 +4
Lines 6218 6303 +85
==========================================
+ Hits 2250 2261 +11
- Misses 3968 4042 +74
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I've done a quick review; this looks good to me, but have requested @bytesnake also give it a look as he is probably more familiar with the algorithm side of things.
I've done a quick review; this looks good to me, but have requested @bytesnake also give it a look as he is probably more familiar with the algorithm side of things.
thank you for reviewing @relf @quietlychris
Thank you for the reviews, and @relf for the suggestions, I have implemented them.
Changes:
- The
rngfield for both random projections structs is no longer optional, and defaults to Xoshiro256Plus with a fixed seed if not provided by the user. - Renamed
precisionparameter toeps, as increasing this parameter results in a lower dimension embedding and often yields lower classification performance. - Added a check that the projections reduce the dimension of the data, returning an error otherwise. Added tests for this behavior.
- Added a test for the function that computes the embedding dimension from a given epsilon, against values from scikit-learn .
- Fixed reference issue in docs and other minor issues.
Thanks for your contribution and the changes. Now, gaussian and sparse random projection codes look very alike, I am wondering if you could not refactor even further by using zero-sized types and a unique
RandomProjectiongeneric type, something like:struct Gaussian; struct Sparse; pub struct RandomProjectionValidParams<RandomMethod, R: Rng + Clone> { pub params: RandomProjectionParamsInner, pub rng: Option<R>, pub method: std::marker::PhantomData<RandomMethod>, } pub struct RandomProjectionParams<RandomMethod, R: Rng + Clone>( pub(crate) RandomProjectionValidParams<RandomMethod, R>, ); pub struct RandomProjection<RandomMethod, F: Float> { projection: Array2<F>, method: std::marker::PhantomData<RandomMethod>, } pub struct GaussianRandomProjection<F: Float> = RandomProjection<Gaussian, F: Float>; pub struct SparseRandomProjection<F: Float> = RandomProjection<Sparse, F: Float>; impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Gausssian, R> where F: Float, Rec: Records<Elem = F>, StandardNormal: Distribution<F>, R: Rng + Clone, { type Object = RandomProjection<Gaussian, F>; fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...} } impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Sparse, R> where F: Float, Rec: Records<Elem = F>, StandardNormal: Distribution<F>, R: Rng + Clone, { type Object = RandomProjection<Sparse, F>; fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...} } ...What do you think?
I think that's a very good suggestion, it will be easier to maintain than the previous approach using a macro to avoid code duplication. 6b9c2a4 implements a variation of this idea: all the logic has been refactored, and behavior depending on the projection method has been encapsulated in the ProjectionMethod trait. It also makes implementing other projection methods significantly easier.