PipelineDP
PipelineDP copied to clipboard
sklearn `Pipeline` support and examples
Feature Description
DP is valuable in many "traditional" machine learning pipelines, and sklearn is the largest "traditional" ML ecosystem in Python. Would examples or first-class support for scikit-learn Pipeline
workflows be worth contributing? We (@licensio) would be happy to contribute this via PR.
Is your feature request related to a problem?
The "framework-free" examples could easily be adapted to sklearn workflows, but substantially more concise usage would be possible with proper sklearn.Pipeline
support.
What alternatives have you considered?
As discussed above, sklearn users could adapt the framework-agnostic examples.
Additional Context
N/A
Thanks Michael for suggestion! It sounds interesting. We're open to add native support of different APIs (though having an example is a good start). We have on our roadmap to have better integration with the Python ecosystem.
Let's at first understand how it might look like. I'm not familiar with scikit-learn Pipeline
workflows (I've just quickly checked its documentation). Could you please explain your ideas for an example of using PipelineDP and scikit-learn Pipeline?
Let me work up a few options. I think there might be two distinct use cases - one for unsupervised workflows (e.g., clustering) and one for supervised workflows (e.g., regression).
In the meantime, here are a few more references that might be helpful if you are curious:
- https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/text.py#L557
- https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_function_transformer.py#L19
- https://scikit-learn.org/stable/modules/preprocessing.html
Hey Michael:
We've had a couple of internal teams think about this. Would you be open to a 30 min chat on this topic?