linfa Dataset improvements

Most of the time the dataset is implemented for ArrayBase<D, Ix2> for records and ArrayBase<D, Ix1> for targets. We could simplify a lot of impls by:

[x] renaming Dataset to DatasetBase
[x] implement Dataset and DatasetView
[ ] implement Dataframe
[x] support multi-targets and label probabilities

Implement for new types

[x] k-folding
[x] non-copying k-folding iter_fold
[x] bootstrapping
[x] shuffling

Improve ergonomics:

[x] feature naming
[x] target naming

Dec 11 '20 16:12 bytesnake

Hi, I would love to help with this, but I can't say that I understand completely what you have in mind. Could you explain it a little more? Is the following what you had in mind?

Keep what's already implemented for Dataset but rename the struct DatasetBase (kind of like ndarray with Array and ArrayBase?)
Add new types Dataset and DatasetView that use ArrayBase and ArrayView for the records and targets like you described. Would one do this by pub type Dataset = DatasetBase<ArrayBase<D, Ix2>,ArrayBase<D, Ix1>> ?

Dec 16 '20 17:12 Sauro98

both points are right, we would just follow the ndarray model more closely. Because of the coming holidays, my plan was to schedule that for beginning of next year. You can give it a try, though it's rather diligence work as all algorithms have to be updated :D

Dec 16 '20 19:12 bytesnake

Well I'll try then and I'll let you know how it goes 👍🏻

Dec 16 '20 19:12 Sauro98

Hi, is there some specific reason why some datasets in the datasets crate are returned using a Vec for the targets instead of an Array1?

Dec 17 '20 15:12 Sauro98

no this should default to Array1

Dec 17 '20 16:12 bytesnake

I have one doubt about iter_fold. Shouldn't iterator items possibly outlive the object from which the iterator is created? I mean, one would have to:

take a mutable reference to the original dataset
split that reference into mutable slices
for each item in the iterator swap the content of some slices and then recombine the slices in two arrays

But at that point, the content of the various items would overlap, and calling collect on that iterator would yield a mess right? I mean if one wanted to use two items from the iterator at the same time then there would be a conflict on the contents of the slices composing the two items right?

Wouldn't such implementation need something like this proposal?

I'm asking because as a rust noob I just can't wrap my head around it

Jan 05 '21 20:01 Sauro98

We can't ensure that somebody is holding a reference while modifying the content. One solution would involve RefCell, which basically moves the borrow check to the runtime and panics when the content is atm borrowed for mutation. Another solution would drop the iterator completely and just take a closure with Dataset -> Model. The caller would then ensure that the reference to the dataset is dropped after the model is trained.

Jan 07 '21 12:01 bytesnake

And what would a function like the second one you mention return? The k validation scores or the k couples (model, validation set)? I'm trying to make some sort of connection with scikit-learn's cross_val_score.

Jan 07 '21 13:01 Sauro98

It should return impl Iterator<Item = (model, validation set)>, but I don't know wether the model type can be chosen on the return type of a closure

Jan 07 '21 14:01 bytesnake

Ok so I thought I had it because now it complies but there's one tiny problem. I tried to implement iter_fold for fittable algorithms but the problem here is that fit requires that the dataset used for fitting has the same lifetime as the model. That means that the view that is given as input to the closure can't be used to train the model because it's too short-lived.

Jan 07 '21 22:01 Sauro98

mh this was introduced to associate the correct lifetime for the kernel matrix here. We could copy the kernel matrix and avoid the lifetime there

Jan 10 '21 11:01 bytesnake

I've been a little busy lately and I'm proceeding slowly so I thought I'd give a little "status-report" here, also to get a second opinion on what I've been doing.

First part:

removed lifetime requirement on the training set for fittable models;
completed the implementation of iter_fold for fittable models;
tested the implementation on a mock struct implementing Fit;

Second part:

Here things get a little less straightforward. I didn't want to force people using SVMs to clone their kernel every time, so I am trying to change SVMs to accept either an owned kernel or a "view" of one. To that end I changed linfa-kernel in this way:

Added two new enum variants to KernelInner to be able to take a view of sparse or dense kernel matrices;
Renamed Kernel to KernelBase in linfa-kernel;
Added types:
- KernelArray : uses ArrayBase<D, Ix2> as records
- Kernel: uses Array2 as records
- Kerneliew: uses ArrayView2 as records
Moved implementations relying only on the inner matrix to KernelBase and the ones relying also on the dataset to KernelArray;
Implemented new methods for Kernel and KernelView where the first one takes ownership of the input data and both use owned inner matrices;
Implemented view methods for Kernel and KernelView where the first one takes a view of the dataset and both take views of their inner matrices;

This way instead of taking a reference to the kernel one can call kernel.view() and not worry about data being cloned.

At this point the problem is that I would need to be able to differentiate between an SVM that has a view of data that lives as long as itself and one that owns the data in itself. I don't know if that can be done with just one type, I'm trying not to complicate the code too much.

Now I still need to modify linfa-svm to adjust it to these changes, but I plan to do it sometime in the next weeks. One thing that bothers me is: since the KernelInner can contain both a view or owned data I'm forced to specify a lifetime for the type even if I am not planning to use the view type. For example, the type Kernel still needs a lifetime even if it can't have a view as an inner matrix, but I don't know if that's avoidable somehow.

Jan 14 '21 19:01 Sauro98

I've been a little busy lately and I'm proceeding slowly so I thought I'd give a little "status-report" here, also to get a second opinion on what I've been doing.

I'm currently re-locating to a different city and this skipped completely my radar, sorry for the late response.

First part:

* removed lifetime requirement on the training set for fittable models;

* completed the implementation of `iter_fold` for fittable models;

* tested the implementation on a mock struct implementing `Fit`;

:+1:

Second part:

Here things get a little less straightforward. I didn't want to force people using SVMs to clone their kernel every time, so I am trying to change SVMs to accept either an owned kernel or a "view" of one. To that end I changed linfa-kernel in this way:
* Added two new enum variants to `KernelInner` to be able to take a view of sparse or dense kernel matrices;

* Renamed `Kernel` to `KernelBase` in `linfa-kernel`;

* Added types:
  
  * `KernelArray` : uses `ArrayBase<D, Ix2>` as records
  * `Kernel`: uses `Array2` as records
  * `Kerneliew`: uses `ArrayView2` as records

* Moved implementations relying only on the inner matrix to `KernelBase` and the ones relying also on the dataset to `KernelArray`;

* Implemented `new` methods for `Kernel` and `KernelView` where the first one takes ownership of the input data and both use owned inner matrices;

* Implemented `view` methods for `Kernel` and `KernelView` where the first one takes a view of the dataset and both take views of their inner matrices;
This way instead of taking a reference to the kernel one can call kernel.view() and not worry about data being cloned.

this seems to be the straightforward way to have a view on a kernel matrix.

At this point the problem is that I would need to be able to differentiate between an SVM that has a view of data that lives as long as itself and one that owns the data in itself. I don't know if that can be done with just one type, I'm trying not to complicate the code too much.

you are here in a similar spot as the std::borrow::Cow implementation because you cannot suppress the lifetime in the type signature

Now I still need to modify linfa-svm to adjust it to these changes, but I plan to do it sometime in the next weeks. One thing that bothers me is: since the KernelInner can contain both a view or owned data I'm forced to specify a lifetime for the type even if I am not planning to use the view type. For example, the type Kernel still needs a lifetime even if it can't have a view as an inner matrix, but I don't know if that's avoidable somehow.

no you cannot avoid this without introducing SvmView and Svm, but then you have to decide what to return in SvmParams::fit.

I think that it would be reasonable to copy only those sample points which are support vectors. The whole point of SVM is to find a few SV's which can describe the separating hyperplane and we can easily copy those. (btw. this is only the case for the non-linear case, otherwise we do not need to cache the samples and have memory complexity O(n)) The following changes would be reasonable:

introduce a KernelInner::Dynamic: a placeholder where the distance function is calculated on-the-fly without caching
add a Kernel::select function: this takes the indices of the sub-samples and returns a copy of the kernel with owned records and empty (i.e. KernelInner::Dynamic) kernel matrix.
store only the smaller kernel of the support vectors Svm and don't borrow the whole kernel beyond the fitting process
in Svm::predict the weighted_sum is only passed the actual scaling factors of the SV solution (we could also think about not storing the kernel at all for the linear case, but would probably rename the Kernel::select then)

hope this is plausible, I think we should not make the API more complex as it already is

Jan 20 '21 18:01 bytesnake

you are here in a similar spot as the std::borrow::Cow implementation because you cannot suppress the lifetime in the type signature

One way I've found is to define a trait "Inner" with all methods concerning only inner matrices and implement it for Array2, ArrayView2, CsMat, and CsMatView. Then I define an enum:

pub enum KernelInnerBase<K1: Inner, K2:Inner> {
    Dense(K1),
    Sparse(k2),
}

Now the definition of kernel becomes:

pub struct  Kernel<R: Records, K1: Inner, K2: Inner> {
    ....
    inner: KernelInner<K1,K2>,
    ....
}

So I can define Kernel<F> = Kernel<Array2<F>, Array2<F>, CsMat<F>>. Then the naming gets a little confusing because I think both KernelArrayView<ArrayView2<'a, F>, Array2<F>, CsMat<F>> and KernelView<ArrayView2<'a,F>, ArrayView2<'a,F>, CsMatView<'a,F>> would be useful.

With this kind of implementation, all methods like diag and such become just calls to the corresponding methods of either K1 or K2 depending on whether the inner matrix is sparse or not.

I agree that the API should stay as simple as possible. Right now I'm deep into studying/taking my finals so I'll slowly get to the points you made about SVM in the next weeks. 👍🏻

Jan 20 '21 19:01 Sauro98

What if in SVMs the kernel was specified as a parameter? I mean something like this:

Svm::params().with_kernel(Kernel::params()).fit(&dataset)

That way fit would take the original dataset instead of the transformed kernel and inside fit one could do something like this:

let kernel_matrix = self.kernel.transform(dataset);
...
svm.kernel = kernel_matrix;
svm.dataset = dataset;
...

without the need to store dataset inside the kernel. Then the Svm could be improved to pick the support vectors from dataset and clone only them instead of the whole set.

Feb 04 '21 14:02 Sauro98

What if in SVMs the kernel was specified as a parameter? I mean something like this:
Svm::params().with_kernel(Kernel::params()).fit(&dataset)

this should be called with_kernel_params or take the actual kernel matrix, alternatively we could add separate functions for the different kernel functions, like .with_gaussian(eps) etc.

That way fit would take the original dataset instead of the transformed kernel and inside fit one could do something like this:
let kernel_matrix = self.kernel.transform(dataset);
...
svm.kernel = kernel_matrix;
svm.dataset = dataset;
...
without the need to store dataset inside the kernel. Then the Svm could be improved to pick the support vectors from dataset and clone only them instead of the whole set.

sounds like a good solution :+1:

Feb 05 '21 14:02 bytesnake

in Elements of Statistical Learning the training vectors are sometimes called predictors, perhaps that's a better name than records

Feb 07 '21 10:02 bytesnake

To me, calling dataset.records() sounds more straightforward than dataset.predictors(), just from what I would generally expect from a dataset-like structure, but since this structure is focused purely on ML maybe it would make sense?. I know that Understanding Machine Learning uses predictor to refer to the hypothesis learned and instances to refer to domain points, so I think that predictors could be misleading to people who learned ML through that book. Other names that I heard a lot in my uni courses are samples or examples but I don't know how often any of those terms are actually used in literature.

Feb 07 '21 10:02 Sauro98

Good point! ESL uses predictors as a synonym for training data in the Least Angle Regression section and it should only be used for those features which are used in the prediction. If you have selected a subset to decrease the variance of your model, then only those are the predictors as opposed to records. I don't like samples, examples because they can refer to the records as well as targets. I think we should stick to records

Feb 07 '21 11:02 bytesnake

I agree, even if it may not be the best possible name I think that it is general enough to at least avoid conflicts 👍🏻

Feb 07 '21 11:02 Sauro98

I've run into a problem when trying to implement the proposed changes to the regression metrics traits. Since both SingleTargetRegression and MultiTargetRegression provide methods with the same name, as soon as I implement both SingleTargetRegression and MultiTargetRegressionfor a certain type I get errors because there are multiple methods available with the same name, even if they apply to different compared targets. I don't know if there is a nice way to solve this, one solution is to keep defaulting to multi-target as it happens now or either to make the API a little uglier and rename them to single_target_metric and multi_target_metric.

If I use associated types it gets even worse, because I can only provide one implementation (and so only one option for the associated type) for each AsTarget type, since I can't take the associated type as a template parameter in the trait definition.

I think that the problem can't be overcome in rust since there is no concept of function overloading.

This is what I've tried so far:

Case 1: no associated types

pub trait SingleTargetRegression<F,T> : AsTargets<F> {
    fn max_error(&self, &T) -> ... {...}
}

pub trait MultiTargetRegression<F,T> : AsTargets<F> {
    fn max_error(&self, &T) -> ... {...}
}

impl<F,type1> SingleTargetRegression<F,type1> for type {}  
impl<F,type2> MultiTargetRegression<F,type2> for type {}

// error: multiple methods with the same name
type.max_error(..)

This means that I can't have Array2 use single target metrics when compared to an array1 and multi-target metrics when compared to array2

Case 2: Associated types

Defining type in trait template

pub trait SingleTargetRegression<F,T> : AsTargets<F> {
    // error: associated type defaults are unstable
    type Tar = T;
    fn max_error(&self, &T) -> ... {...}
}

Defining type in impl template

pub trait SingleTargetRegression<F> : AsTargets<F> {
    type Tar : AsTargets<F>;
    fn max_error(&self, &T) -> ... {...}
}

// error: unconstrained type parameter
impl<type2> SingleTargetRegression<F> for type {
    type Tar = type2;
}

implementing twice

pub trait SingleTargetRegression<F> : AsTargets<F> {
    type Tar : AsTargets<F>;
    fn max_error(&self, &T) -> ... {...}
}

// error: conflicting implementation
impl SingleTargetRegression<F> for type {
    type Tar = base_type;
}

impl SingleTargetRegression<F> for type {
    type Tar = base_type2;
}

Mar 05 '21 16:03 Sauro98

mh I wasn't aware of the Array2/Array1 implementation, but yeah I think it is not possible with the current trait system (though I'm absolutely not an expert there) the only workaround I see is to implement MultiTargetRegression like this:

impl<F: Float, D: Data<Elem = F>> MultiTargetRegression<F, ArrayBase<D, Ix1>> for ArrayBase<D, Ix2> {}

and hope that most people are using predict in combination with this trait.

Mar 05 '21 16:03 bytesnake

Yes it is implemented like that, just allowing the two arrays to have different data, the only issue is that array2 can only implement multi target metrics.

(In this comment and in the ones before i am using array1 \ array2, loosely to indicate both owned arrays and views)

Mar 05 '21 17:03 Sauro98

a comment under the reddit post mentioned polars and I'm now thinking about adding a dataframe type:

pub type Dataframe<T> = DatasetBase<polars::DataFrame, polars::ChunkedArray<T>>

observe the missing type on polars::Dataframe, this would allow us to represent generic records with categorical/continuous features mixed. For example naive bayes could implement Fit for a Dataframe as well and handle the categorical features with multinomial distributions

Mar 09 '21 15:03 bytesnake

Wow, I didn't know linfa was being posted on Reddit 🚀 I had never thought of categorical features, we would still need to convert categorical features into continuous ones before being able to use them in most algorithms right? Or would this be something to be used only on some specific algorithms like the naive Bayes that you mentioned?

Mar 09 '21 15:03 Sauro98

the latter for example it makes no sense for ICA to support categorical features, but we could extend the scope of some algorithms with categorical types

Mar 09 '21 16:03 bytesnake

I guess that dataframes would also be useful for count vectorizers right?

Mar 10 '21 15:03 Sauro98

you can write a transformer f(Array<String>) -> Array2 without dataframes, but would make handling strings easier

Mar 10 '21 17:03 bytesnake

I think that .with_feature_names can provide a longer vector than there are features.

Oct 14 '22 13:10 CGMossa