yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG](Metric) Plots in the UI only show the latest data point

Open pks opened this issue 10 months ago • 6 comments

Issues Policy acknowledgement

  • [X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

  • Client: 2.12.1
  • Tracking server: 2.11.3

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 12 (bitnami mlflow helm chart)
  • Python version: 3.11
  • yarn version, if running the dev UI: n/a

Describe the problem

Although my training run (NeMo) logged several values for a metric during a run (val_sacreBLEU/avg in my case) the UI only shows the latest data point, either as a block diagram ("Model metrics" view) or as a chart with a single point ("Charts" view):

Screenshot 2024-04-18 at 11 10 37 Screenshot 2024-04-18 at 11 11 20

Querying MLFlow directly it correctly returns the two data points logged so far:

> cat test.py
from mlflow.tracking import MlflowClient

def export_run_metrics_to_csv(run_id, csv_output_path):
    client = MlflowClient(tracking_uri="https://mlflow.lilt.dev")
    metrics = client.get_run(run_id).data.metrics
    with open(csv_output_path, 'w') as f:
        f.write('metric_name, value\n')
        for key, value in metrics.items():
            print(f"{key = } {value = }")
            f.write(f'{key}, {value}\n')

client = MlflowClient(tracking_uri="https://mlflow.lilt.dev")
metric_history = client.get_metric_history("1e30704d13d94595ba30e8c7205d5ab1", "val_sacreBLEU/avg")
values = [m.value for m in metric_history]
timestamps = [m.timestamp for m in metric_history]
print(f"{list(zip(timestamps, values))=}")

> python test.py
list(zip(timestamps, values))=[(1713390106404, 34.32115879673068), (1713420121935, 37.03728539282564)]

Tracking information

System information: Linux #19~22.04.2-Ubuntu SMP Thu Mar 21 16:45:46 UTC 2024
Python version: 3.10.12
MLflow version: 2.12.1
MLflow module location: /home/pks/<redacted>/.venv/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: file:///home/pks/<redacted>/mlruns
Registry URI: file:///home/pks/<redacted>/mlruns
MLflow dependencies:
  Flask: 3.0.2
  Jinja2: 3.1.3
  aiohttp: 3.9.3
  alembic: 1.13.1
  boto3: 1.34.40
  botocore: 1.34.40
  click: 8.1.7
  cloudpickle: 3.0.0
  docker: 7.0.0
  entrypoints: 0.4
  gitpython: 3.1.41
  google-cloud-storage: 2.14.0
  graphene: 3.3
  gunicorn: 21.2.0
  importlib-metadata: 7.1.0
  kubernetes: 18.20.0
  markdown: 3.5.2
  matplotlib: 3.8.2
  numpy: 1.23.5
  packaging: 23.2
  pandas: 2.2.0
  protobuf: 4.25.2
  pyarrow: 15.0.0
  pydantic: 1.10.14
  pytz: 2024.1
  pyyaml: 6.0.1
  querystring-parser: 1.2.4
  requests: 2.31.0
  scikit-learn: 1.4.0
  scipy: 1.12.0
  sqlalchemy: 2.0.29
  sqlparse: 0.5.0

Code to reproduce issue

REPLACE_ME

Stack trace

REPLACE_ME

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [X] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

pks avatar Apr 18 '24 09:04 pks

Could you share the code used for logging the metrics?

daniellok-db avatar Apr 18 '24 23:04 daniellok-db

We're using NVIDIA's NeMo, specifically the MTEncDec model. The logging call is this:

https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py#L435

pks avatar Apr 19 '24 07:04 pks

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

github-actions[bot] avatar Apr 26 '24 00:04 github-actions[bot]

Hi, I am having the same issue with mlflow tracking server and client 2.11.3 (using pytorch lightning MLFlowLogger logger.log(..., on_step=True, on_epoch=True)). The issue appeared after upgrading mlflow. More precisely, the values logged at each gradient step are shown properly but the values logged at each epoch (average train loss and validation loss) show as a single point. It seems the logging call refered to in the link above is a pytorch lightning logger call, so the root cause is likely the same.

cmantoux avatar Apr 29 '24 13:04 cmantoux

Does the issue still persist if you upgrade the tracking server to 2.12.1? There was a bug fix to sampling logic that should be shipped in the latest version.

daniellok-db avatar Apr 29 '24 14:04 daniellok-db

We're using Bitnami's helm chart and they didn't upgrade yet (server-side on 2.11.3). Will report back once they do.

pks avatar Apr 29 '24 14:04 pks

This appears to be fixed with 2.12.1 which is now available. Thanks!

pks avatar May 06 '24 16:05 pks