Instructive tutorial and discussion of feature type inference, encoding, and missingness of data
We are making many assumptions and implemented quite a lot of edge cases for e.g. encoding and imputation. We should extend the contributors guide by documenting these.
I suggest here to make a new tutorial that focuses solely on feature type inference, encoding, and imputation. And show what effect the different arguments have. For instance, on the diabetes_130_fairlearn dataset, or even just the first 30 samples of this.
There should also be a mention of what is considered "missing data" by default.
I see a rough structure alike this, but this can be adjusted
- Description of the notebook, load diabetes_130 dataset
- Infer feature types, and show both the columns as well as the inferred feature types, and briefly explain why what column has been encoded as is. Infer also with another set of arguments to showcase the difference.
- Encode features, and show both the columns as well as the encoding, and briefly explain why what column has been encoded as is. Show encoding with another set of arguments to showcase the difference.
- Show what is considered to be missing data in ehrapy; discuss edgecases, such as "nan" vs np.nan values in the dataseat
The type of data and limitations should also be included, gathering the info from #713 in one place:
Description of feature
The discussion on whether we want to distinguish between ordinal and nominal categorical features in ehrapy was raised while calculating feature correlations as part of the new bias detection method (PR https://github.com/theislab/ehrapy/pull/690).
As of now, the feature correlations would be the only application in ehrapy where we would need the differentiation between nominal and ordinal features. Because, as soon as data are encoded, detecting this difference automatically is nearly impossible, we would add quite some effort for the user, as they would have to manually declare what features are ordinal and what nominal just to compute the feature correlations using the optimal method (Spearman CC vs. Cramer's V, for instance). Additionally, just computing Spearman/Pearson CC for all features won't show any correlations that aren't there, but just some correlations between categorical features won't be revealed. However, those should then be detected by the feature importances calculation.
Hence, we decided to stick to Pearson/Spearman CC for all features as of now. If in the future the differentiation between ordinal and nominal categorical features becomes important at other places in ehrapy, it would be easy to adapt the bias detection method accordingly.
@eroell you mixed concerns here.
The original issue was about DEVELOPER documentation. Now it's more about user stuff. I don't know whether we should have a new tutorial for this but let's have it isolated and then we can see if/how we can merge it