linfa icon indicating copy to clipboard operation
linfa copied to clipboard

Feat: implement Random Projections

Open GBathie opened this issue 1 year ago • 5 comments

This PR implements random projection techniques for dimensionality reduction, as seen in the sklean.random_projection module of scikit-learn

GBathie avatar Feb 17 '24 09:02 GBathie

Codecov Report

Attention: Patch coverage is 8.79121% with 83 lines in your changes are missing coverage. Please review.

Project coverage is 35.87%. Comparing base (4e40ce6) to head (6b9c2a4).

Files Patch % Lines
...nfa-reduction/src/random_projection/hyperparams.rs 3.44% 28 Missing :warning:
...s/linfa-reduction/src/random_projection/methods.rs 0.00% 26 Missing :warning:
...infa-reduction/src/random_projection/algorithms.rs 21.87% 25 Missing :warning:
...ms/linfa-reduction/src/random_projection/common.rs 0.00% 4 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #332      +/-   ##
==========================================
- Coverage   36.18%   35.87%   -0.32%     
==========================================
  Files          92       96       +4     
  Lines        6218     6303      +85     
==========================================
+ Hits         2250     2261      +11     
- Misses       3968     4042      +74     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Feb 17 '24 09:02 codecov-commenter

I've done a quick review; this looks good to me, but have requested @bytesnake also give it a look as he is probably more familiar with the algorithm side of things.

quietlychris avatar Feb 17 '24 15:02 quietlychris

I've done a quick review; this looks good to me, but have requested @bytesnake also give it a look as he is probably more familiar with the algorithm side of things.

thank you for reviewing @relf @quietlychris

bytesnake avatar Feb 22 '24 07:02 bytesnake

Thank you for the reviews, and @relf for the suggestions, I have implemented them.

Changes:

  • The rng field for both random projections structs is no longer optional, and defaults to Xoshiro256Plus with a fixed seed if not provided by the user.
  • Renamed precision parameter to eps, as increasing this parameter results in a lower dimension embedding and often yields lower classification performance.
  • Added a check that the projections reduce the dimension of the data, returning an error otherwise. Added tests for this behavior.
  • Added a test for the function that computes the embedding dimension from a given epsilon, against values from scikit-learn .
  • Fixed reference issue in docs and other minor issues.

GBathie avatar Mar 01 '24 17:03 GBathie

Thanks for your contribution and the changes. Now, gaussian and sparse random projection codes look very alike, I am wondering if you could not refactor even further by using zero-sized types and a unique RandomProjection generic type, something like:

struct Gaussian;
struct Sparse;

pub struct RandomProjectionValidParams<RandomMethod, R: Rng + Clone> {
    pub params: RandomProjectionParamsInner,
    pub rng: Option<R>,
    pub method: std::marker::PhantomData<RandomMethod>,
}

pub struct RandomProjectionParams<RandomMethod, R: Rng + Clone>(
    pub(crate) RandomProjectionValidParams<RandomMethod, R>,
);

pub struct RandomProjection<RandomMethod, F: Float> {
    projection: Array2<F>,
    method: std::marker::PhantomData<RandomMethod>,
}

pub struct GaussianRandomProjection<F: Float> = RandomProjection<Gaussian, F: Float>;
pub struct SparseRandomProjection<F: Float> = RandomProjection<Sparse, F: Float>;

impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Gausssian, R>
where
    F: Float,
    Rec: Records<Elem = F>,
    StandardNormal: Distribution<F>,
    R: Rng + Clone,
{
    type Object = RandomProjection<Gaussian, F>;

    fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...}
}

impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Sparse, R>
where
    F: Float,
    Rec: Records<Elem = F>,
    StandardNormal: Distribution<F>,
    R: Rng + Clone,
{
    type Object = RandomProjection<Sparse, F>;

    fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...}
}

...

What do you think?

I think that's a very good suggestion, it will be easier to maintain than the previous approach using a macro to avoid code duplication. 6b9c2a4 implements a variation of this idea: all the logic has been refactored, and behavior depending on the projection method has been encapsulated in the ProjectionMethod trait. It also makes implementing other projection methods significantly easier.

GBathie avatar Mar 02 '24 16:03 GBathie