yocto-gl
yocto-gl copied to clipboard
[BUG] performing a custom split persists an additional feature labeled `split`
Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the issues policy
Where did you encounter this bug?
Databricks
Willingness to contribute
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
MLflow version
Mlflow 2.12.1
System information
- Databricks: 1 Driver 16 GB Memory, 2 Cores Runtime 13.3.x-cpu-ml-scala2.12 r5d.large 0.45 DBU/h
Describe the problem
When using a custom split instead of the default split_ratios the output dataframes for training_data
, validation_data
, and test_data
of the split
step results in an extra column labeled split
Tracking information
REPLACE_ME
Code to reproduce issue
# in split.py
import pandas as pd
from pandas import DataFrame, Series
def split_fn(df: DataFrame):
df.loc[0:50, 'split'] = 'TRAINING'
df.loc[50:100, 'split'] = "TEST"
df.loc[100::, 'split'] = 'VALIDATION'
custom_series = pd.Series(df.split)
return custom_series
## ----------------------------
# in recipe.yaml
split:
using: "custom"
split_method: split_fn
## ----------------------------
# checking the outputs
from mlflow.recipes import Recipe
r = Recipe(profile="databricks")
r.get_artifact("training_data") ## this will show the added column
## ----------------------------
# workaround solution in databricks notebook between steps `split` and `transform`
import mlflow
from mlflow.recipes.utils import (
get_recipe_config,
get_recipe_name,
get_recipe_root_path,
)
from mlflow.recipes.utils.execution import get_step_output_path
_OUTPUT_TRAIN_FILE_NAME = "train.parquet"
_OUTPUT_VALIDATION_FILE_NAME = "validation.parquet"
_OUTPUT_TEST_FILE_NAME = "test.parquet"
training_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TRAIN_FILE_NAME)
validation_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_VALIDATION_FILE_NAME)
test_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TEST_FILE_NAME)
train_df = pd.read_parquet(training_path)
validation_df = pd.read_parquet(validation_path)
test_df = pd.read_parquet(test_path)
train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])
train_df.to_parquet(training_path)
validation_df.to_parquet(validation_path)
test_df.to_parquet(test_path)
Stack trace
REPLACE_ME
Other info / logs
# potential solution?
# drop split after it gets loaded in lines
#https://github.com/mlflow/mlflow/blob/5cdae7c4321015620032d02a3b84fb6127247392/mlflow/recipes/steps/split.py#L353-L358
# by adding
train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [X]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.