linfa icon indicating copy to clipboard operation
linfa copied to clipboard

Add sample naming

Open gkobeaga opened this issue 3 years ago • 3 comments

It might be useful to have named samples for debugging and exploring the data, e.g. identifying outliers.

Comments:

  • I have my doubts with the method sample_names (and feature_names) returning a non-empty vector even if the names for the samples are not set? As it is, when sample_names is empty, it returns vec!["sample-1", "sample-2",..,"sample-{nsamples}"]. The same happens for feature_names. However, in some methods where new dataset are created transforming the input dataset, the output datasets have a non empty sample_names field even if the input dataset has an empty sample_names field. For instance, it happens for the preprocessing methods. An alternative choice could be to make public the feature_names and the sample_names fields (as for the weights field), and to clone them if needed.
  • I have added sample names for the example datasets (diabetes...). I wanted to implement test_retain_sample_names for the preprocessing methods. Is it ok?
  • The method with_feature_names does not check if the given vector has length nfeatures. Do you want to implement it in this PR, as I have done for with_sample_names?

gkobeaga avatar Oct 13 '22 23:10 gkobeaga

I don't think having named samples is a good idea, because there's almost always way more samples than features, and dedicating one string per sample can get out of hand very quickly. Also, sample names don't play well with methods that reorder or extract samples from the dataset (this isn't done for features).

I do agree about feature_names returning non-empty vector if when there are no names. It should instead just return empty list. Ideally the signature should be fn feature_names(&self) -> &[String] to prevent cloning, but this can be in a separate MR. Also with_feature_names should definitely check the length of the input vector. The input should be Vec<String> (again, this change can be in another MR).

In #70 it mentions target names, so you could take a crack at that instead.

YuhanLiin avatar Oct 19 '22 03:10 YuhanLiin

I understand your doubts, but I think that we can resolve them.

Regarding the memory requirements, the sample names are optional. If for some use cases these might be memory-consuming, they can just be ignored. We can also set a string capacity for sample names if we anticipate performance issues in large datasets.

Regarding the reordering of the samples, I agree that it should be done carefully when sample names are present. But we already do it for the targets. Anyway, as far as I know, the reordering of samples is only done in the cross-validation, and in this case, we can ignore the sample names, since the intermediate databases are not returned by the method.

Thanks for your suggestions for features_names, I will implement them. (When you say MR do you mean PR? sorry I don't know what MR means).

gkobeaga avatar Oct 19 '22 23:10 gkobeaga

The memory usage would be too large for most real-world datasets. Capping the string capacity doesn't solve this because the problem is the number of strings, which must match the number of samples, leading to an unacceptable number of allocations. The usefulness of sample names is also limited IMO, since you can easily track down an outlier by finding it in the original dataset. Also, most datasets for ML don't come with sample names that you can just import, so users will need to go through the trouble of creating their own generic names for every sample, making it hard to use.

Sample names also cause trouble with methods that extracts subsets of samples from a dataset, such as bootstrap_samples. This happens a lot more with samples than with targets, and the sample names will need to be returned. I don't want to add the maintenance overhead of splitting sample names correctly in those functions.

Finally, in the scikit-learn dataset API, feature and target names are available but sample names are not. Sample names have no precedent in the popular ML library, are not easy to use (in fact they are easy to misuse, since they'd consume too much memory on most datasets), and their usecase is ultimately too niche to justify the maintenance overhead IMO.

MR is Gitlab-speak for PR. I mixed it up since I'm using it at my day job lol.

YuhanLiin avatar Oct 21 '22 04:10 YuhanLiin