yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] performing a custom split persists an additional feature labeled `split`

Open hopemiranda opened this issue 9 months ago • 1 comments

Issues Policy acknowledgement

  • [X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Databricks

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

Mlflow 2.12.1

System information

  • Databricks: 1 Driver 16 GB Memory, 2 Cores Runtime 13.3.x-cpu-ml-scala2.12 r5d.large 0.45 DBU/h

Describe the problem

When using a custom split instead of the default split_ratios the output dataframes for training_data, validation_data, and test_data of the split step results in an extra column labeled split

Tracking information

REPLACE_ME

Code to reproduce issue

# in split.py

import pandas as pd
from pandas import DataFrame, Series

def split_fn(df: DataFrame):
    df.loc[0:50, 'split'] = 'TRAINING'
    df.loc[50:100, 'split'] = "TEST"
    df.loc[100::, 'split'] = 'VALIDATION'
    custom_series = pd.Series(df.split)

    return custom_series

## ----------------------------

# in recipe.yaml
  split:
    using: "custom"
    split_method: split_fn

## ----------------------------

# checking the outputs
from mlflow.recipes import Recipe
r = Recipe(profile="databricks")
r.get_artifact("training_data") ## this will show the added column

## ----------------------------

# workaround solution in databricks notebook between steps `split` and `transform`

import mlflow
from mlflow.recipes.utils import (
    get_recipe_config,
    get_recipe_name,
    get_recipe_root_path,
)
from mlflow.recipes.utils.execution import get_step_output_path

_OUTPUT_TRAIN_FILE_NAME = "train.parquet"
_OUTPUT_VALIDATION_FILE_NAME = "validation.parquet"
_OUTPUT_TEST_FILE_NAME = "test.parquet"

training_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TRAIN_FILE_NAME)
validation_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_VALIDATION_FILE_NAME)
test_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TEST_FILE_NAME)

train_df = pd.read_parquet(training_path)
validation_df = pd.read_parquet(validation_path)
test_df = pd.read_parquet(test_path)

train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

train_df.to_parquet(training_path)
validation_df.to_parquet(validation_path)
test_df.to_parquet(test_path)

Stack trace

REPLACE_ME

Other info / logs

# potential solution?
# drop split after it gets loaded in lines 
#https://github.com/mlflow/mlflow/blob/5cdae7c4321015620032d02a3b84fb6127247392/mlflow/recipes/steps/split.py#L353-L358
# by adding
train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

What component(s) does this bug affect?

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [X] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

hopemiranda avatar May 07 '24 23:05 hopemiranda

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

github-actions[bot] avatar May 15 '24 00:05 github-actions[bot]