Dataset improvements
Most of the time the dataset is implemented for ArrayBase<D, Ix2> for records and ArrayBase<D, Ix1> for targets. We could simplify a lot of impls by:
- [x] renaming
DatasettoDatasetBase - [x] implement
DatasetandDatasetView - [ ] implement
Dataframe - [x] support multi-targets and label probabilities
Implement for new types
- [x] k-folding
- [x] non-copying k-folding
iter_fold - [x] bootstrapping
- [x] shuffling
Improve ergonomics:
- [x] feature naming
- [x] target naming
Hi, I would love to help with this, but I can't say that I understand completely what you have in mind. Could you explain it a little more? Is the following what you had in mind?
- Keep what's already implemented for
Datasetbut rename the structDatasetBase(kind of like ndarray with Array and ArrayBase?) - Add new types
DatasetandDatasetViewthat useArrayBaseandArrayViewfor the records and targets like you described. Would one do this bypub type Dataset = DatasetBase<ArrayBase<D, Ix2>,ArrayBase<D, Ix1>>?
both points are right, we would just follow the ndarray model more closely. Because of the coming holidays, my plan was to schedule that for beginning of next year. You can give it a try, though it's rather diligence work as all algorithms have to be updated :D
Well I'll try then and I'll let you know how it goes 👍🏻
Hi, is there some specific reason why some datasets in the datasets crate are returned using a Vec for the targets instead of an Array1?
no this should default to Array1
I have one doubt about iter_fold. Shouldn't iterator items possibly outlive the object from which the iterator is created? I mean, one would have to:
- take a mutable reference to the original dataset
- split that reference into mutable slices
- for each item in the iterator swap the content of some slices and then recombine the slices in two arrays
But at that point, the content of the various items would overlap, and calling collect on that iterator would yield a mess right? I mean if one wanted to use two items from the iterator at the same time then there would be a conflict on the contents of the slices composing the two items right?
Wouldn't such implementation need something like this proposal?
I'm asking because as a rust noob I just can't wrap my head around it
We can't ensure that somebody is holding a reference while modifying the content. One solution would involve RefCell, which basically moves the borrow check to the runtime and panics when the content is atm borrowed for mutation.
Another solution would drop the iterator completely and just take a closure with Dataset -> Model. The caller would then ensure that the reference to the dataset is dropped after the model is trained.
And what would a function like the second one you mention return? The k validation scores or the k couples (model, validation set)? I'm trying to make some sort of connection with scikit-learn's cross_val_score.
It should return impl Iterator<Item = (model, validation set)>, but I don't know wether the model type can be chosen on the return type of a closure
Ok so I thought I had it because now it complies but there's one tiny problem. I tried to implement iter_fold for fittable algorithms but the problem here is that fit requires that the dataset used for fitting has the same lifetime as the model. That means that the view that is given as input to the closure can't be used to train the model because it's too short-lived.
mh this was introduced to associate the correct lifetime for the kernel matrix here. We could copy the kernel matrix and avoid the lifetime there
I've been a little busy lately and I'm proceeding slowly so I thought I'd give a little "status-report" here, also to get a second opinion on what I've been doing.
First part:
- removed lifetime requirement on the training set for fittable models;
- completed the implementation of
iter_foldfor fittable models; - tested the implementation on a mock struct implementing
Fit;
Second part:
Here things get a little less straightforward. I didn't want to force people using SVMs to clone their kernel every time, so I am trying to change SVMs to accept either an owned kernel or a "view" of one. To that end I changed linfa-kernel in this way:
- Added two new enum variants to
KernelInnerto be able to take a view of sparse or dense kernel matrices; - Renamed
KerneltoKernelBaseinlinfa-kernel; - Added types:
KernelArray: usesArrayBase<D, Ix2>as recordsKernel: usesArray2as recordsKerneliew: usesArrayView2as records
- Moved implementations relying only on the inner matrix to
KernelBaseand the ones relying also on the dataset toKernelArray; - Implemented
newmethods forKernelandKernelViewwhere the first one takes ownership of the input data and both use owned inner matrices; - Implemented
viewmethods forKernelandKernelViewwhere the first one takes a view of the dataset and both take views of their inner matrices;
This way instead of taking a reference to the kernel one can call kernel.view() and not worry about data being cloned.
At this point the problem is that I would need to be able to differentiate between an SVM that has a view of data that lives as long as itself and one that owns the data in itself. I don't know if that can be done with just one type, I'm trying not to complicate the code too much.
Now I still need to modify linfa-svm to adjust it to these changes, but I plan to do it sometime in the next weeks. One thing that bothers me is: since the KernelInner can contain both a view or owned data I'm forced to specify a lifetime for the type even if I am not planning to use the view type. For example, the type Kernel still needs a lifetime even if it can't have a view as an inner matrix, but I don't know if that's avoidable somehow.
I've been a little busy lately and I'm proceeding slowly so I thought I'd give a little "status-report" here, also to get a second opinion on what I've been doing.
I'm currently re-locating to a different city and this skipped completely my radar, sorry for the late response.
First part:
* removed lifetime requirement on the training set for fittable models; * completed the implementation of `iter_fold` for fittable models; * tested the implementation on a mock struct implementing `Fit`;
:+1:
Second part:
Here things get a little less straightforward. I didn't want to force people using SVMs to clone their kernel every time, so I am trying to change SVMs to accept either an owned kernel or a "view" of one. To that end I changed
linfa-kernelin this way:* Added two new enum variants to `KernelInner` to be able to take a view of sparse or dense kernel matrices; * Renamed `Kernel` to `KernelBase` in `linfa-kernel`; * Added types: * `KernelArray` : uses `ArrayBase<D, Ix2>` as records * `Kernel`: uses `Array2` as records * `Kerneliew`: uses `ArrayView2` as records * Moved implementations relying only on the inner matrix to `KernelBase` and the ones relying also on the dataset to `KernelArray`; * Implemented `new` methods for `Kernel` and `KernelView` where the first one takes ownership of the input data and both use owned inner matrices; * Implemented `view` methods for `Kernel` and `KernelView` where the first one takes a view of the dataset and both take views of their inner matrices;This way instead of taking a reference to the kernel one can call
kernel.view()and not worry about data being cloned.
this seems to be the straightforward way to have a view on a kernel matrix.
At this point the problem is that I would need to be able to differentiate between an SVM that has a view of data that lives as long as itself and one that owns the data in itself. I don't know if that can be done with just one type, I'm trying not to complicate the code too much.
you are here in a similar spot as the std::borrow::Cow implementation because you cannot suppress the lifetime in the type signature
Now I still need to modify
linfa-svmto adjust it to these changes, but I plan to do it sometime in the next weeks. One thing that bothers me is: since theKernelInnercan contain both a view or owned data I'm forced to specify a lifetime for the type even if I am not planning to use the view type. For example, the typeKernelstill needs a lifetime even if it can't have a view as an inner matrix, but I don't know if that's avoidable somehow.
no you cannot avoid this without introducing SvmView and Svm, but then you have to decide what to return in SvmParams::fit.
I think that it would be reasonable to copy only those sample points which are support vectors. The whole point of SVM is to find a few SV's which can describe the separating hyperplane and we can easily copy those. (btw. this is only the case for the non-linear case, otherwise we do not need to cache the samples and have memory complexity O(n)) The following changes would be reasonable:
- introduce a
KernelInner::Dynamic: a placeholder where the distance function is calculated on-the-fly without caching - add a
Kernel::selectfunction: this takes the indices of the sub-samples and returns a copy of the kernel with owned records and empty (i.e.KernelInner::Dynamic) kernel matrix. - store only the smaller kernel of the support vectors
Svmand don't borrow the whole kernel beyond the fitting process - in
Svm::predicttheweighted_sumis only passed the actual scaling factors of the SV solution (we could also think about not storing the kernel at all for the linear case, but would probably rename theKernel::selectthen)
hope this is plausible, I think we should not make the API more complex as it already is
you are here in a similar spot as the std::borrow::Cow implementation because you cannot suppress the lifetime in the type signature
One way I've found is to define a trait "Inner" with all methods concerning only inner matrices and implement it for Array2, ArrayView2, CsMat, and CsMatView. Then I define an enum:
pub enum KernelInnerBase<K1: Inner, K2:Inner> {
Dense(K1),
Sparse(k2),
}
Now the definition of kernel becomes:
pub struct Kernel<R: Records, K1: Inner, K2: Inner> {
....
inner: KernelInner<K1,K2>,
....
}
So I can define Kernel<F> = Kernel<Array2<F>, Array2<F>, CsMat<F>>. Then the naming gets a little confusing because I think both KernelArrayView<ArrayView2<'a, F>, Array2<F>, CsMat<F>> and KernelView<ArrayView2<'a,F>, ArrayView2<'a,F>, CsMatView<'a,F>> would be useful.
With this kind of implementation, all methods like diag and such become just calls to the corresponding methods of either K1 or K2 depending on whether the inner matrix is sparse or not.
I agree that the API should stay as simple as possible. Right now I'm deep into studying/taking my finals so I'll slowly get to the points you made about SVM in the next weeks. 👍🏻
What if in SVMs the kernel was specified as a parameter? I mean something like this:
Svm::params().with_kernel(Kernel::params()).fit(&dataset)
That way fit would take the original dataset instead of the transformed kernel and inside fit one could do something like this:
let kernel_matrix = self.kernel.transform(dataset);
...
svm.kernel = kernel_matrix;
svm.dataset = dataset;
...
without the need to store dataset inside the kernel. Then the Svm could be improved to pick the support vectors from dataset and clone only them instead of the whole set.
What if in SVMs the kernel was specified as a parameter? I mean something like this:
Svm::params().with_kernel(Kernel::params()).fit(&dataset)
this should be called with_kernel_params or take the actual kernel matrix, alternatively we could add separate functions for the different kernel functions, like .with_gaussian(eps) etc.
That way
fitwould take the original dataset instead of the transformed kernel and inside fit one could do something like this:let kernel_matrix = self.kernel.transform(dataset); ... svm.kernel = kernel_matrix; svm.dataset = dataset; ...without the need to store
datasetinside the kernel. Then the Svm could be improved to pick the support vectors fromdatasetand clone only them instead of the whole set.
sounds like a good solution :+1:
in Elements of Statistical Learning the training vectors are sometimes called predictors, perhaps that's a better name than records
To me, calling dataset.records() sounds more straightforward than dataset.predictors(), just from what I would generally expect from a dataset-like structure, but since this structure is focused purely on ML maybe it would make sense?. I know that Understanding Machine Learning uses predictor to refer to the hypothesis learned and instances to refer to domain points, so I think that predictors could be misleading to people who learned ML through that book. Other names that I heard a lot in my uni courses are samples or examples but I don't know how often any of those terms are actually used in literature.
Good point! ESL uses predictors as a synonym for training data in the Least Angle Regression section and it should only be used for those features which are used in the prediction. If you have selected a subset to decrease the variance of your model, then only those are the predictors as opposed to records. I don't like samples, examples because they can refer to the records as well as targets. I think we should stick to records
I agree, even if it may not be the best possible name I think that it is general enough to at least avoid conflicts 👍🏻
I've run into a problem when trying to implement the proposed changes to the regression metrics traits.
Since both SingleTargetRegression and MultiTargetRegression provide methods with the same name, as soon as I implement both SingleTargetRegression and MultiTargetRegressionfor a certain type I get errors because there are multiple methods available with the same name, even if they apply to different compared targets.
I don't know if there is a nice way to solve this, one solution is to keep defaulting to multi-target as it happens now or either to make the API a little uglier and rename them to single_target_metric and multi_target_metric.
If I use associated types it gets even worse, because I can only provide one implementation (and so only one option for the associated type) for each AsTarget type, since I can't take the associated type as a template parameter in the trait definition.
I think that the problem can't be overcome in rust since there is no concept of function overloading.
This is what I've tried so far:
Case 1: no associated types
pub trait SingleTargetRegression<F,T> : AsTargets<F> {
fn max_error(&self, &T) -> ... {...}
}
pub trait MultiTargetRegression<F,T> : AsTargets<F> {
fn max_error(&self, &T) -> ... {...}
}
impl<F,type1> SingleTargetRegression<F,type1> for type {}
impl<F,type2> MultiTargetRegression<F,type2> for type {}
// error: multiple methods with the same name
type.max_error(..)
This means that I can't have Array2 use single target metrics when compared to an array1 and multi-target metrics when compared to array2
Case 2: Associated types
Defining type in trait template
pub trait SingleTargetRegression<F,T> : AsTargets<F> {
// error: associated type defaults are unstable
type Tar = T;
fn max_error(&self, &T) -> ... {...}
}
Defining type in impl template
pub trait SingleTargetRegression<F> : AsTargets<F> {
type Tar : AsTargets<F>;
fn max_error(&self, &T) -> ... {...}
}
// error: unconstrained type parameter
impl<type2> SingleTargetRegression<F> for type {
type Tar = type2;
}
implementing twice
pub trait SingleTargetRegression<F> : AsTargets<F> {
type Tar : AsTargets<F>;
fn max_error(&self, &T) -> ... {...}
}
// error: conflicting implementation
impl SingleTargetRegression<F> for type {
type Tar = base_type;
}
impl SingleTargetRegression<F> for type {
type Tar = base_type2;
}
mh I wasn't aware of the Array2/Array1 implementation, but yeah I think it is not possible with the current trait system (though I'm absolutely not an expert there) the only workaround I see is to implement MultiTargetRegression like this:
impl<F: Float, D: Data<Elem = F>> MultiTargetRegression<F, ArrayBase<D, Ix1>> for ArrayBase<D, Ix2> {}
and hope that most people are using predict in combination with this trait.
Yes it is implemented like that, just allowing the two arrays to have different data, the only issue is that array2 can only implement multi target metrics.
(In this comment and in the ones before i am using array1 \ array2, loosely to indicate both owned arrays and views)
a comment under the reddit post mentioned polars and I'm now thinking about adding a dataframe type:
pub type Dataframe<T> = DatasetBase<polars::DataFrame, polars::ChunkedArray<T>>
observe the missing type on polars::Dataframe, this would allow us to represent generic records with categorical/continuous features mixed. For example naive bayes could implement Fit for a Dataframe as well and handle the categorical features with multinomial distributions
Wow, I didn't know linfa was being posted on Reddit 🚀 I had never thought of categorical features, we would still need to convert categorical features into continuous ones before being able to use them in most algorithms right? Or would this be something to be used only on some specific algorithms like the naive Bayes that you mentioned?
the latter for example it makes no sense for ICA to support categorical features, but we could extend the scope of some algorithms with categorical types
I guess that dataframes would also be useful for count vectorizers right?
you can write a transformer f(Array<String>) -> Array2 without dataframes, but would make handling strings easier
I think that .with_feature_names can provide a longer vector than there are features.