yocto-gl
yocto-gl copied to clipboard
[FR] ability to drop columns that get used for split, but not for training
Willingness to contribute
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
Proposal Summary
Add the ability to split the ingested data by groups that don't get included in the training set.
For example, have the option to use sklearn GroupShuffleSplit
within the split
step of the recipe. Without using the split_by_feature
as a feature in the training set
from sklearn.model_selection import GroupShuffleSplit
GroupShuffleSplit(test_size=0.2, n_splits=2, random_state=2).split(
data, groups=data[split_by_feature]
)
Motivation
What is the use case for this feature?
Modeling using stratified sampling for the training and test sets
Why is this use case valuable to support for MLflow users in general?
Built in stratified sampling would help with avoiding workarounds to use this method within the split
step of recipes
Why is this use case valuable to support for your project(s) or organization?
All of our models require stratified sampling in order to work as intended
Why is it currently difficult to achieve this use case?
As the code is now any features that get ingested and used in a grouped split will be fed to the next step for transformations. Since transformations get registered with the model that means the unused feature stays
Details
No response
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [X]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.
For this, I added a column transformer in the beginning of "transformation" step, which drops the columns that I don't need anymore.
@e-taghizadeh your column transformer approach requires you to pass the extra columns to the registered model if you want to get predictions from it, is that right?
@bhough199 yes, exactly.