smartcore icon indicating copy to clipboard operation
smartcore copied to clipboard

First draft of the new n-dimensional arrays + NB use case

Open VolodymyrOrlov opened this issue 3 years ago • 20 comments

This PR will eventually solve #85. The goal is to have an abstract layer that can be implemented by concrete library that enables all traditional functionality provided by n-dimensional arrays: array manipulation, transformation, basic linear algebra routines, views e.t.c.

The abstract layer I am proposing here enables us to:

  • have arrays that can hold multiple types, including strings
  • implement more efficient ML algorithms that minimize memory copy operations
  • have ML algorithms that declare more granular requirements to input and output types, e.g. we can declare an input type for an algorithm that does need a simple non-mutable array w/o typical linear algebra routines.

As a bonus, this PR introduces a row-major dense matrix that might get handy in some cases.

VolodymyrOrlov avatar May 31 '21 19:05 VolodymyrOrlov

@morenol there are multiple problems with this PR, like printlns, lack of documentation, e.t.c. and I plan to fix these within next 3-4 of weeks. But I thought I'd share this PR with you so that you can take a look on proposed changes earlier. Besides, I've decided to try new abstract layer on NB first.

VolodymyrOrlov avatar May 31 '21 19:05 VolodymyrOrlov

Codecov Report

Merging #108 (e99bd70) into v0.5-wip (e99bd70) will not change coverage. The diff coverage is n/a.

:exclamation: Current head e99bd70 differs from pull request most recent head 55d2c16. Consider uploading reports for the commit 55d2c16 to get more accurate results

@@            Coverage Diff            @@
##           v0.5-wip     #108   +/-   ##
=========================================
  Coverage     38.72%   38.72%           
=========================================
  Files            86       86           
  Lines          7031     7031           
=========================================
  Hits           2723     2723           
  Misses         4308     4308           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar May 31 '21 19:05 codecov-commenter

@VolodymyrOrlov this looks nice.

Since we are rethinking the way that the arrays are represented I think that we should consider the usage of const generics.

Any concrete ideas how we can use const generics to improve the proposed abstract layer? For example, const generics can be used to define number of dimensions in S

VolodymyrOrlov avatar Jun 01 '21 21:06 VolodymyrOrlov

I think that I edited your comment instead of a response: here it is

Any concrete ideas how we can use const generics to improve the proposed abstract layer? For example, const generics can be used to define number of dimensions in S

I am not sure how we can use them in the abstract layer. I was thinking in the structs defined in linalg::dense::. We can create an special type for static sized arrays. But maybe it is better to keep it simple and use nalgebra and ndarray when we want that kind of optimization

morenol avatar Jun 01 '21 23:06 morenol

Great work Volodymyr! This looks really promising. Just few notes to check if I understood the new abstraction and to help devs in digging:

  • we are going to have new traits groups to define types for array/vectors elements: Num/Number and FloatNumber (I like it!) directly using num_traits;
  • A lot of things going on in linalg/base.rs:
    • new smartcore Array-related type declarations to provide a unified interface, including the ones from optimised strides read/write.
    • this interface is the toolbox to be leveraged by higher-level types, in particular the new implementations in linalg/dense/;
  • we have now "views" types that, I assume, are to provide segmentation for the unerlying structs so to allow reusing of the same instantiated struct and avoid copying;
  • supercool the "Decomposable"s for structs decomposition.

Will help with documentation as soon as it is relatively stable.

@steboss check it out!

Mec-iS avatar Jul 03 '21 12:07 Mec-iS

Great work Volodymyr! This looks really promising. Just few notes to check if I understood the new abstraction and to help devs in digging:

  • we are going to have new traits groups to define types for array/vectors elements: Num/Number and FloatNumber (I like it!) directly using num_traits;

  • A lot of things going on in linalg/base.rs:

    • new smartcore Array-related type declarations to provide a unified interface, including the ones from optimised strides read/write.
    • this interface is the toolbox to be leveraged by higher-level types, in particular the new implementations in linalg/dense/;
  • we have now "views" types that, I assume, are to provide segmentation for the unerlying structs so to allow reusing of the same instantiated struct and avoid copying;

  • supercool the "Decomposable"s for structs decomposition.

Will help with documentation as soon as it is relatively stable.

@Steboss check it out!

That's right! The goal is to build a rich interface for 1- and 2-dimensional arrays that can be extended by anyone later. On one hand we can make ML algorithms more efficient by relying on a concrete library, like nalgebra or ndarray. Investing time into an abstract layer that facilitates integration with 3rd partly libraries, on the other hand, gives us a lot of flexibility and saves time later, if we find it necessary to bring in some other library in the future.

VolodymyrOrlov avatar Jul 03 '21 22:07 VolodymyrOrlov

@morenol please take a look, there are still missing implementations but the code is at least cleaned up.

Mec-iS avatar Sep 20 '22 16:09 Mec-iS

  • conflicts with development should be now solved
  • run the linter, try to compile
  • there are some types that are not correct, in particular in src/linear/ridge_regression.rs and src/linear/logistic_regression.rs
  • tests need to be double-checked against the new typed structures and implementations; possibly write new ones
  • there is a problem with distance metrics
  • ... (probably hundreds of things more) 💯

Mec-iS avatar Sep 20 '22 17:09 Mec-iS

please fetch this branch and run cargo build --lib, maybe there are some errors for which the solution is evident to you people.

Mec-iS avatar Sep 21 '22 13:09 Mec-iS

please fetch this branch and run cargo build --lib, maybe there are some errors for which the solution is evident to you people.

I can take a look later today

morenol avatar Sep 21 '22 14:09 morenol

sure, when you have time. If we can fix the static analyzer (in src/ensemble) and compiler errors maybe we can move into running tests 👍🏼 Also I don't understand how the implementation for RandomForestClassifier.samples should work as in this branch the property has been removed.

Mec-iS avatar Sep 21 '22 15:09 Mec-iS

@montanalow please take a look to ensemble and linear modules as you are more familiar with those procedures. The objective is to make things to compile and then move into making the tests to pass. Also I saw some conflicts when pulling the very latest changes.

Mec-iS avatar Sep 22 '22 10:09 Mec-iS

great job @morenol Can you tell me why we are missing definitions for ElasticNetSearchParameters, LassoSearchParameters, LinearRegressionSearchParameters? that instead are present in development? I will try a merge, another hundred conflicts to solve

Mec-iS avatar Sep 22 '22 14:09 Mec-iS

ElasticNetSearchParameters

not completely sure, we probably need to split in several PRs once we have all the conflicts solved in order to make it easy to review

morenol avatar Sep 22 '22 14:09 morenol

I've got most of the merge conflicts sorted in my personal branch, but we need to apply many of the updates in this branch to the new search params, and apply the rng changes to this branch. I think I understand it enough to get it done in an hour or so.

montanalow avatar Sep 22 '22 17:09 montanalow

@montanalow great! I was splitting the changes into multiple branches but if you can do in one pass that would be awesome.

Mec-iS avatar Sep 22 '22 17:09 Mec-iS

@montanalow great! I was splitting the changes into multiple branches but if you can do in one pass that would be awesome.

#171 is up that should resolve all merge conflicts in this branch, although it's a fairly large change set...

montanalow avatar Sep 22 '22 18:09 montanalow

The library now compiles but there are still a lot of runtime errors and tests to fix.

I have not yet a clear understanding about what it is going to be the substitute for DenseMatrix, I thought it should have been Array2 but how to instantiate it? For example this test. Also predict_oob is still broken as I don't know how to implement it without using the samples property that has been removed (probably is supposed to use arrays views?).

I am not still sure we can merge this all at once, I will probably start opening a PR to merge num.rs so we can do it gradually.

Mec-iS avatar Sep 23 '22 15:09 Mec-iS

I am seeing discrepancies between the metrics::recall module in development and the one in this branch. Things are getting too entangled, please move and follow up to #173 for a more stepwise implementation.

Mec-iS avatar Sep 26 '22 16:09 Mec-iS

please help. it is quite easy, just follow the implementations in the other ported modules. cc: @morenol @montanalow @titoeb

Missing modules,

  • preprocessing
  • readers
  • svm

Mec-iS avatar Oct 11 '22 15:10 Mec-iS

All work moved to https://github.com/smartcorelib/smartcore/tree/v0.5-wip

Mec-iS avatar Oct 13 '22 18:10 Mec-iS