yocto-gl [BUG] evaluator_config={"average":None} not working in mlflow.evaluate() for multiclass classification

Issues Policy acknowledgement

[X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

2.9.2

System information

databricks

Describe the problem

For multiclass classification problem, if I use evaluator_config = {"average": None} to set the averaging method to use when computing classification metrics, I will get a list of float [0.7, 0.8, 0.6] as my metrics for each class. But this metric list can not pass validation check, as it expect only numerical values as metrics not a list. https://github.com/mlflow/mlflow/blob/c43823750bffa5b6abcc086683b15a068513b67b/mlflow/utils/validation.py#L137

Tracking information

No response

Code to reproduce issue

with mlflow.start_run() as run:
    # Evaluate the static dataset without providing a model
    result = mlflow.evaluate(
        data=eval_data,
        targets="label",
        predictions="predictions",
        model_type="classifier",
        evaluator_config={"average": None}

Stack trace

Traceback (most recent call last):
...
  File "c:\TEMP\...\evaluate.py", line 80, in task
    mlflow.evaluate(
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\models\evaluation\base.py", line 1878, in evaluate
    evaluate_result = _evaluate(
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\models\evaluation\base.py", line 1120, in _evaluate
    eval_result = evaluator.evaluate(
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\models\evaluation\default_evaluator.py", line 1826, in evaluate
    evaluation_result = self._evaluate(model, is_baseline_model=False)
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\models\evaluation\default_evaluator.py", line 1744, in _evaluate
    self._log_metrics()
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\models\evaluation\default_evaluator.py", line 712, in _log_metrics
    self.client.log_batch(
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\tracking\client.py", line 1086, in log_batch
    return self._tracking_client.log_batch(
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\tracking\_tracking_service\client.py", line 459, in log_batch
    self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\store\tracking\file_store.py", line 1040, in log_batch
    _validate_batch_log_data(metrics, params, tags)
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\utils\validation.py", line 317, in _validate_batch_log_data
    _validate_metric(metric.key, metric.value, metric.timestamp, metric.step)
  File "C:\TEMP\...\.venv\lib\site-packages\mlflow\utils\validation.py", line 146, in _validate_metric
    raise MlflowException(
mlflow.exceptions.MlflowException: Got invalid value [0.78723404 0.13793103 0.08333333 0.        ] for metric 'recall_score' (timestamp=1706264841975). Please specify value as a valid double (64-bit floating point)

Other info / logs

No response

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[X] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[X] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

Jan 25 '24 14:01 RRRen94

@RRRen94 Could you paste the full stacktrace where it triggers _validate_metric?

Jan 26 '24 02:01 serena-ruan

@serena-ruan Hey, full stacktrace is now updated.

Jan 26 '24 10:01 RRRen94

@RRRen94 I think the original design is to use weighted if "average" method is not set (or None), so to fix the bug I would set "average" to "weighted" if it's None. But for your use case, do you want to log the metric value of each class specifically? Otherwise the workaround is just to remove evaluator_config when calling .evaluate

Jan 29 '24 03:01 serena-ruan

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

Feb 02 '24 00:02 github-actions[bot]

@RRRen94 I think the original design is to use weighted if "average" method is not set (or None), so to fix the bug I would set "average" to "weighted" if it's None. But for your use case, do you want to log the metric value of each class specifically? Otherwise the workaround is just to remove evaluator_config when calling .evaluate

Yes, I want to directly see the metrics value of each class specifically, not weighted or averaged. These single values of each class can be found now in per_class_metrics.csv as artifact. But it could be nice to also have the option to log not averaged values as metrics, for example by using evaluator_config={"average": None}.

Feb 02 '24 08:02 RRRen94

Then I think the best effort would be log each class's metric value separately as now mlflow metric doesn't support lists.

Feb 06 '24 07:02 serena-ruan

cc @prithvikannan WDYT?

Feb 06 '24 07:02 serena-ruan

yocto-gl yocto-gl copied to clipboard

[BUG] evaluator_config={"average":None} not working in mlflow.evaluate() for multiclass classification

Issues Policy acknowledgement

Where did you encounter this bug?

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

yocto-gl
yocto-gl copied to clipboard