yocto-gl
yocto-gl copied to clipboard
[FR] evaluate metrics inside a prompt engineering UI run
Willingness to contribute
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
Proposal Summary
I just want to know if we can have a way to perform metrics-based evaluation in a prompt engineering UI run.
I've seen from the documentation here that we can do it by using mlflow.evaluate()
on a data obtained from mlflow.load_table()
from a specific prompt engineering UI run.
I think it is a good idea if we can choose metrics directly when creating a new prompt engineering UI run and later to evaluate them.
Motivation
What is the use case for this feature?
Why is this use case valuable to support for MLflow users in general?
Why is this use case valuable to support for your project(s) or organization?
Why is it currently difficult to achieve this use case?
Details
No response
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
This sounds reasonable to me. cc @daniellok-db
Currently we don't evaluate anything but combine model inputs, outputs and some parameters into the table.
But to support evaluation, we might need more inputs than just the metrics, ground_truth
field for example.
cc @prithvikannan @hubertzub-db
This sounds reasonable to me. cc @daniellok-db Currently we don't evaluate anything but combine model inputs, outputs and some parameters into the table. But to support evaluation, we might need more inputs than just the metrics,
ground_truth
field for example.
Yes basically it is the mlflow.evaluate
inside the prompt engineering UI specific for LLM evaluation
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.