yocto-gl [FR] Experiment Pagination in UI

Willingness to contribute

Yes. I would be willing to contribute this feature with guidance from the MLflow community.

Proposal Summary

#7804 went a long way in improving the performance of the UI by virtualizing the experiments list.

Eventually, the UI will stop scaling for large numbers of experiments because all experiments are fetched when loading the HomePage in the UI.

This issue is a follow up on the points from #7174 about batching/paginating the fetch for experiments.

It might be good for someone who is quite familiar with the code base to pick this up. I attempted in #7174 and ended up having to touch far more components than initially thought.

There are two? potential approaches:

Decouple experiments in the redux store from components
- Each component is responsible for fetching any experiments it needs on the fly
- Might be more clean than option 2
- May result in more http requests (possibly worth it)
Find a way to paginate fetching of experiments and keep them in the redux store
- All components cannot rely on all experiments being in the store and may have to fetch some on the fly.
- All fetched experiments probably can't remain in the store indefinitely (1,000,000 experiments would not bode well).
- Take some of the work from #7174 for this

Things to watch out for if going the redux store route:

Clearing the store
- Filtering/list ordering is hard without clearing the store, items will get weirdly out of order. Javascript map reorders the experiments in the store every time because the key is an int
- If the store is cleared, sharing a link to an experiment not in the store will break.
Checked keys in the ExperimentListView requires some special state handling.
Sharing links - all components that involve experiment info in the query parameters would need a looking at

Similar request in https://github.com/mlflow/mlflow/issues/4288. It that got closed out as completed (likely by accident, the UI part got missed).

Motivation

What is the use case for this feature?

Allowing the mlflow UI to scale beyond 5-10k experiments.

Why is this use case valuable to support for MLflow users in general?

Users with larger numbers of experiments can avoid client and server-side issues.

Why is this use case valuable to support for your project(s) or organization?

Our use case has a large number of experiments with plans to grow.

Why is it currently difficult to achieve this use case?

All experiments are fetched at once. (and I'm not that great at react 🤣)

Details

For testing, I've included a python script to seed the development database vs. running mlflow server.

Wonder if it is worth including something like this in the repo itself and the contributing docs? It would allow those contributing to the UI to devleop against a more realistic use case. Maybe something like this already exists?

import argparse
import contextlib
import subprocess
import os
from typing import Generator
import uuid

from mlflow.store.tracking import sqlalchemy_store
from mlflow.entities import RunStatus, SourceType
from mlflow.entities.lifecycle_stage import LifecycleStage
from mlflow.store.tracking.dbmodels.models import (
    SqlExperiment,
    SqlRun,
)
from mlflow.utils.uri import append_to_uri_path


@contextlib.contextmanager
def setup_mlflow_database(db_path: str, experiments: int, runs: int) -> Generator[str, str, None]:
    """Use mlflow store to set up a basic database and seed it."""
    if os.path.isfile(db_path):
        os.remove(db_path)
    db_uri = f"sqlite:///{db_path}"
    # Does not use a windows path here because it can't be replaced
    default_artifact_root = "mlruns"

    store = sqlalchemy_store.SqlAlchemyStore(
        db_uri=db_uri, default_artifact_root=default_artifact_root
    )

    store.create_experiment(name="what")
    experiments_to_make = [
        SqlExperiment(name=name, artifact_location=f"{default_artifact_root}/{name}")
        for name in (i for i in range(experiments))
    ]
    with store.ManagedSessionMaker() as session:
        store._save_to_db(objs=experiments_to_make, session=session)

    with store.ManagedSessionMaker() as session:
        experiments_made = session.query(SqlExperiment).all()
        runs_to_make = [
            SqlRun(
                name=f"hello mlflow {r}",
                artifact_uri=append_to_uri_path(
                    experiment.artifact_location,
                    uuid.uuid4().hex,
                    sqlalchemy_store.SqlAlchemyStore.ARTIFACTS_FOLDER_NAME,
                ),
                run_uuid=uuid.uuid4().hex,
                experiment_id=experiment.experiment_id,
                source_type=SourceType.to_string(SourceType.UNKNOWN),
                source_name="",
                entry_point_name="",
                user_id="hello",
                status=RunStatus.to_string(RunStatus.RUNNING),
                source_version="",
                lifecycle_stage=LifecycleStage.ACTIVE,
            )
            for r in range(runs)
            for experiment in experiments_made
        ]
        print(f"runs made {len(runs_to_make)}")
        store._save_to_db(objs=runs_to_make, session=session)
    del store
    try:
        yield db_uri
    finally:
        os.remove(db_path)


def start_server(backend_store_uri: str) -> None:
    _ = subprocess.check_output(
        [
            "mlflow",
            "ui",
            "--backend-store-uri",
            backend_store_uri,
            "--host",
            "localhost",
            "--port",
            "5000",
        ],
    )


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("-e", "--experiments", default=5000, help="Number of experiments to create")
    parser.add_argument("-r", "--runs", default=50, help="Number of runs to create")
    parsed = parser.parse_args()
    with setup_mlflow_database("test-db.db", parsed.experiments, parsed.runs) as db_uri:
        start_server(db_uri)
    return 0


if __name__ == "__main__":
    exit(main())

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[ ] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[X] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

Mar 10 '23 21:03 jmahlik

Thanks @jmahlik!

@sunishsheth2009 Can you advise here? Option 1 sounds quite appealing :)

Mar 13 '23 22:03 dbczumar

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

Mar 18 '23 00:03 mlflow-automation

Hey @jmahlik thank you for the suggestions.

Each component is responsible for fetching any experiments it needs on the fly

Can you explain a bit more here? Basically are we just talking about the experiment list on the left side where it only requires the name so only fetch the name, and let the right side fetch everything else? Or am I missing something? I think the right side is already configured to fetch just 1 experiment it requires as far as I remember.

Also I like the paginated approach to just fetch the name of the experiment. Shouldn't be as bad to store 1M(which is super unlikely that a user creates 1M experiments) experiment names in the redux store.

Mar 19 '23 17:03 sunishsheth2009

Can you explain a bit more here? Basically are we just talking about the experiment list on the left side where it only requires the name so only fetch the name, and let the right side fetch everything else? Or am I missing something? I think the right side is already configured to fetch just 1 experiment it requires as far as I remember.

When not pulling all experiments right away, it was causing a bunch of issues if an experiment wasn't in the store but a user navigates to it. Maybe I'll give a go around at:

Disconnecting the list from the store entirely
Removing the initial fetch for all experiments

And see what the issues that might cause. Then we could talk through them?

Also I like the paginated approach to just fetch the name of the experiment. Shouldn't be as bad to store 1M(which is super unlikely that a user creates 1M experiments) experiment names in the redux store.

I like this idea. Is there an api that returns only experiment names? Or maybe parse the full json and only retain the name?

Mar 24 '23 21:03 jmahlik

@sunishsheth2009 an example PR in #8180 if you want to take a look.

Apr 05 '23 16:04 jmahlik

@jmahlik 🤔 we should make sure that the solution to this issue will be both well working and intuitive to use. We will try to engage a UX designer and get some guidelines on how to solve this properly, then get back to this thread with some ideas

Apr 12 '23 14:04 hubertzub-db

@jmahlik 🤔 we should make sure that the solution to this issue will be both well working and intuitive to use. We will try to engage a UX designer and get some guidelines on how to solve this properly, then get back to this thread with some ideas

That would be awesome. #8180 is not a good solution IMO, more an example of pitfalls.

Apr 12 '23 16:04 jmahlik

Any updates on getting a UX designer involved? Fetching around 5000 experiments seems to be a tipping point since the payload is pretty large. Depending on network latency/the pc the ui runs on for the js parsing the response body.

Jul 12 '23 18:07 jmahlik

Hi @jmahlik - I'm a UX designer and looking into this. Wondering if a simple pagination control here could help solve this issue?

CleanShot 2023-07-13 at 15 44 14@2x

We can perhaps have upto, lets say 100 runs show up per page, so users don't have to keep switching between pages and can still browse through their list of experiments. Thoughts? CC @hubertzub-db

Jul 13 '23 22:07 ridhimag11

@ridhimag11 thanks! the issue here is that the search experiments API uses cursor-based page tokens, meaning we can get the next/previous page but we don't have an indicator of how many pages/results are there. That being said, I believe we have to either

use "Next page" / "Prev page" (just like in model list page)

or alternatively

use "Load more" pattern like in experiment runs page

what's the better approach here?

Jul 14 '23 07:07 hubertzub-db

Thanks for taking a look @ridhimag11 :).

I'll add a bit more detail. The main thing left to solve for is on loading the home page, all experiments in the experiment table are fetched and put in the redux store. This gets slow when it's a large payload.

The rest of the components expect the experiments to exist in the store (like a local copy of the database table). If all experiments don't exist in the redux store, the other components break Ex. only experiment 0-25 are fetched on load and in the redux store (so in your example page 1), then navigating directly to the link for experiment 5000 404's.

I'm pretty indifferent to how to pagination in the UI actually happens, but think we might run in to the same ordering issues regardless of the pagination style when everything isn't loaded at once.

Jul 14 '23 13:07 jmahlik

Gotcha! Thanks @hubertzub-db and @jmahlik. @hubertzub-db from a UI standpoint, neither of the approaches are ideal here since we'd want the ability for users to go to a specific page in the list (e.g. if they want to see the oldest experiments). That being said, I'm thinking that for consistency purposes on this page (with runs list), we could go with the "load more" pattern here.

Jul 14 '23 22:07 ridhimag11

Thanks @ridhimag11! In this case, we just need to memorize the last next_page_token and use it on hitting "Load more", just like with experiment runs. @jmahlik do you think you want to tackle this one?

Jul 17 '23 08:07 hubertzub-db

Thanks @ridhimag11! In this case, we just need to memorize the last next_page_token and use it on hitting "Load more", just like with experiment runs. @jmahlik do you think you want to tackle this one?

Ran in to quite a few problems last time when trying to not pull everything at once. I've been pretty pressed for time lately so don't think I'd be able to pick it up. Might be better for someone more familiar with the code base/redux.

Jul 19 '23 13:07 jmahlik

@jmahlik Is this still a blocker on your side? Sadly we can't prioritize fixing this issue at the moment, but it will be on our radar.

Aug 17 '23 13:08 hubertzub-db

It does limit the scalability of the UI generally. It's mostly future proofing for the number of experiments growing over time. So not directly blocked at the moment. Once there's around 10k experiments, that's where it starts presenting.

Aug 17 '23 14:08 jmahlik

Guys, any updates here?

I think a realistic production use must support 1M+ experimets / runs. They may be divided over time into "hot" for more recent ones and "cold" for the old ones which will tke more time to fetch. But currently the product suffers from very bad user experience where with only 1K experimets the user experiences very slow response times that can take minutes.

Do you guys have any roadmap for improving performance? If this is interesting to you, I am willing to contribute to the subject.

Sep 13 '23 13:09 progovoy

Hello @progovoy Thanks for reaching out! Sadly improvements to the experiment list are not prioritized at the moment - however, it would be wonderful if you could contribute here. We can provide support where necessary.

To recap the necessary steps to improve the performance here, here's the proposal on how to fix it by implementing simple pagination based on the @ridhimag11's guidelines above:

make the search API call experiment limit modifiable e.g. by function param (here)
fetch fewer experiments on the initial call (e.g. 100)
after retrieving the experiment list, check if there's next_page_token field present in the response and if true, show "Load more" button at the end of the list
- (clicking "Load more" should perform a call similar to the initial one, but with next_page_token attached)
wire up filter input box to the request query, i.e. implement a mechanism that will add filter=name ILIKE "<filter-value>%" query parameter to the search experiments GET API call (according to those docs)
make sure that the page token gets reset after changing the filter query

Does this make sense to you @progovoy ?

Sep 15 '23 11:09 hubertzub-db

Hi everyone! Is there any plan for this feature? I'm looking into mlflow, and IMO when this feature is missing it really makes the UI unusable. A standard ML ecosystem deployment will have thousands of experiments which will already make the UI overloaded. Also there's no way to see experiment creation date, sort by date etc.

Mar 26 '24 15:03 asaff1

Hi everyone! Is there any plan for this feature? I'm looking into mlflow, and IMO when this feature is missing it really makes the UI unusable. A standard ML ecosystem deployment will have thousands of experiments which will already make the UI overloaded. Also there's no way to see experiment creation date, sort by date etc.

It requires some pretty heavy refactoring. I haven't had the bandwidth to give it another shot.

Apr 15 '24 15:04 jmahlik

yocto-gl yocto-gl copied to clipboard

[FR] Experiment Pagination in UI

Willingness to contribute

Proposal Summary

Motivation

What is the use case for this feature?

Why is this use case valuable to support for MLflow users in general?

Why is this use case valuable to support for your project(s) or organization?

Why is it currently difficult to achieve this use case?

Details

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

yocto-gl
yocto-gl copied to clipboard