yocto-gl
yocto-gl copied to clipboard
[BUG](Metric) Plots in the UI only show the latest data point
Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the issues policy
Where did you encounter this bug?
Local machine
Willingness to contribute
No. I cannot contribute a bug fix at this time.
MLflow version
- Client: 2.12.1
- Tracking server: 2.11.3
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 12 (bitnami mlflow helm chart)
- Python version: 3.11
- yarn version, if running the dev UI: n/a
Describe the problem
Although my training run (NeMo) logged several values for a metric during a run (val_sacreBLEU/avg
in my case) the UI only shows the latest data point, either as a block diagram ("Model metrics" view) or as a chart with a single point ("Charts" view):
Querying MLFlow directly it correctly returns the two data points logged so far:
> cat test.py
from mlflow.tracking import MlflowClient
def export_run_metrics_to_csv(run_id, csv_output_path):
client = MlflowClient(tracking_uri="https://mlflow.lilt.dev")
metrics = client.get_run(run_id).data.metrics
with open(csv_output_path, 'w') as f:
f.write('metric_name, value\n')
for key, value in metrics.items():
print(f"{key = } {value = }")
f.write(f'{key}, {value}\n')
client = MlflowClient(tracking_uri="https://mlflow.lilt.dev")
metric_history = client.get_metric_history("1e30704d13d94595ba30e8c7205d5ab1", "val_sacreBLEU/avg")
values = [m.value for m in metric_history]
timestamps = [m.timestamp for m in metric_history]
print(f"{list(zip(timestamps, values))=}")
> python test.py
list(zip(timestamps, values))=[(1713390106404, 34.32115879673068), (1713420121935, 37.03728539282564)]
Tracking information
System information: Linux #19~22.04.2-Ubuntu SMP Thu Mar 21 16:45:46 UTC 2024
Python version: 3.10.12
MLflow version: 2.12.1
MLflow module location: /home/pks/<redacted>/.venv/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: file:///home/pks/<redacted>/mlruns
Registry URI: file:///home/pks/<redacted>/mlruns
MLflow dependencies:
Flask: 3.0.2
Jinja2: 3.1.3
aiohttp: 3.9.3
alembic: 1.13.1
boto3: 1.34.40
botocore: 1.34.40
click: 8.1.7
cloudpickle: 3.0.0
docker: 7.0.0
entrypoints: 0.4
gitpython: 3.1.41
google-cloud-storage: 2.14.0
graphene: 3.3
gunicorn: 21.2.0
importlib-metadata: 7.1.0
kubernetes: 18.20.0
markdown: 3.5.2
matplotlib: 3.8.2
numpy: 1.23.5
packaging: 23.2
pandas: 2.2.0
protobuf: 4.25.2
pyarrow: 15.0.0
pydantic: 1.10.14
pytz: 2024.1
pyyaml: 6.0.1
querystring-parser: 1.2.4
requests: 2.31.0
scikit-learn: 1.4.0
scipy: 1.12.0
sqlalchemy: 2.0.29
sqlparse: 0.5.0
Code to reproduce issue
REPLACE_ME
Stack trace
REPLACE_ME
Other info / logs
REPLACE_ME
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [X]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
Could you share the code used for logging the metrics?
We're using NVIDIA's NeMo, specifically the MTEncDec model. The logging call is this:
https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py#L435
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.
Hi, I am having the same issue with mlflow tracking server and client 2.11.3 (using pytorch lightning MLFlowLogger logger.log(..., on_step=True, on_epoch=True)
). The issue appeared after upgrading mlflow. More precisely, the values logged at each gradient step are shown properly but the values logged at each epoch (average train loss and validation loss) show as a single point.
It seems the logging call refered to in the link above is a pytorch lightning logger call, so the root cause is likely the same.
Does the issue still persist if you upgrade the tracking server to 2.12.1? There was a bug fix to sampling logic that should be shipped in the latest version.
We're using Bitnami's helm chart and they didn't upgrade yet (server-side on 2.11.3
). Will report back once they do.
This appears to be fixed with 2.12.1 which is now available. Thanks!