yocto-gl
yocto-gl copied to clipboard
[FR] Experiment Pagination in UI
Willingness to contribute
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
Proposal Summary
#7804 went a long way in improving the performance of the UI by virtualizing the experiments list.
Eventually, the UI will stop scaling for large numbers of experiments because
all experiments are fetched when loading the HomePage
in the UI.
This issue is a follow up on the points from #7174 about batching/paginating the fetch for experiments.
It might be good for someone who is quite familiar with the code base to pick this up. I attempted in #7174 and ended up having to touch far more components than initially thought.
There are two? potential approaches:
- Decouple experiments in the redux store from components
- Each component is responsible for fetching any experiments it needs on the fly
- Might be more clean than option 2
- May result in more http requests (possibly worth it)
- Find a way to paginate fetching of experiments and keep them in the redux store
- All components cannot rely on all experiments being in the store and may have to fetch some on the fly.
- All fetched experiments probably can't remain in the store indefinitely (1,000,000 experiments would not bode well).
- Take some of the work from #7174 for this
Things to watch out for if going the redux store route:
- Clearing the store
- Filtering/list ordering is hard without clearing the store, items will get weirdly out of order. Javascript map reorders the experiments in the store every time because the key is an int
- If the store is cleared, sharing a link to an experiment not in the store will break.
- Checked keys in the
ExperimentListView
requires some special state handling. - Sharing links - all components that involve experiment info in the query parameters would need a looking at
Similar request in https://github.com/mlflow/mlflow/issues/4288. It that got closed out as completed (likely by accident, the UI part got missed).
Motivation
What is the use case for this feature?
Allowing the mlflow UI to scale beyond 5-10k experiments.
Why is this use case valuable to support for MLflow users in general?
Users with larger numbers of experiments can avoid client and server-side issues.
Why is this use case valuable to support for your project(s) or organization?
Our use case has a large number of experiments with plans to grow.
Why is it currently difficult to achieve this use case?
All experiments are fetched at once. (and I'm not that great at react 🤣)
Details
For testing, I've included a python script to seed the development database vs. running mlflow server
.
Wonder if it is worth including something like this in the repo itself and the contributing docs? It would allow those contributing to the UI to devleop against a more realistic use case. Maybe something like this already exists?
import argparse
import contextlib
import subprocess
import os
from typing import Generator
import uuid
from mlflow.store.tracking import sqlalchemy_store
from mlflow.entities import RunStatus, SourceType
from mlflow.entities.lifecycle_stage import LifecycleStage
from mlflow.store.tracking.dbmodels.models import (
SqlExperiment,
SqlRun,
)
from mlflow.utils.uri import append_to_uri_path
@contextlib.contextmanager
def setup_mlflow_database(db_path: str, experiments: int, runs: int) -> Generator[str, str, None]:
"""Use mlflow store to set up a basic database and seed it."""
if os.path.isfile(db_path):
os.remove(db_path)
db_uri = f"sqlite:///{db_path}"
# Does not use a windows path here because it can't be replaced
default_artifact_root = "mlruns"
store = sqlalchemy_store.SqlAlchemyStore(
db_uri=db_uri, default_artifact_root=default_artifact_root
)
store.create_experiment(name="what")
experiments_to_make = [
SqlExperiment(name=name, artifact_location=f"{default_artifact_root}/{name}")
for name in (i for i in range(experiments))
]
with store.ManagedSessionMaker() as session:
store._save_to_db(objs=experiments_to_make, session=session)
with store.ManagedSessionMaker() as session:
experiments_made = session.query(SqlExperiment).all()
runs_to_make = [
SqlRun(
name=f"hello mlflow {r}",
artifact_uri=append_to_uri_path(
experiment.artifact_location,
uuid.uuid4().hex,
sqlalchemy_store.SqlAlchemyStore.ARTIFACTS_FOLDER_NAME,
),
run_uuid=uuid.uuid4().hex,
experiment_id=experiment.experiment_id,
source_type=SourceType.to_string(SourceType.UNKNOWN),
source_name="",
entry_point_name="",
user_id="hello",
status=RunStatus.to_string(RunStatus.RUNNING),
source_version="",
lifecycle_stage=LifecycleStage.ACTIVE,
)
for r in range(runs)
for experiment in experiments_made
]
print(f"runs made {len(runs_to_make)}")
store._save_to_db(objs=runs_to_make, session=session)
del store
try:
yield db_uri
finally:
os.remove(db_path)
def start_server(backend_store_uri: str) -> None:
_ = subprocess.check_output(
[
"mlflow",
"ui",
"--backend-store-uri",
backend_store_uri,
"--host",
"localhost",
"--port",
"5000",
],
)
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("-e", "--experiments", default=5000, help="Number of experiments to create")
parser.add_argument("-r", "--runs", default=50, help="Number of runs to create")
parsed = parser.parse_args()
with setup_mlflow_database("test-db.db", parsed.experiments, parsed.runs) as db_uri:
start_server(db_uri)
return 0
if __name__ == "__main__":
exit(main())
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [X]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
Thanks @jmahlik!
@sunishsheth2009 Can you advise here? Option 1 sounds quite appealing :)
@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.
Hey @jmahlik thank you for the suggestions.
Each component is responsible for fetching any experiments it needs on the fly
Can you explain a bit more here? Basically are we just talking about the experiment list on the left side where it only requires the name so only fetch the name, and let the right side fetch everything else? Or am I missing something? I think the right side is already configured to fetch just 1 experiment it requires as far as I remember.
Also I like the paginated approach to just fetch the name of the experiment. Shouldn't be as bad to store 1M(which is super unlikely that a user creates 1M experiments) experiment names in the redux store.
Can you explain a bit more here? Basically are we just talking about the experiment list on the left side where it only requires the name so only fetch the name, and let the right side fetch everything else? Or am I missing something? I think the right side is already configured to fetch just 1 experiment it requires as far as I remember.
When not pulling all experiments right away, it was causing a bunch of issues if an experiment wasn't in the store but a user navigates to it. Maybe I'll give a go around at:
- Disconnecting the list from the store entirely
- Removing the initial fetch for all experiments
And see what the issues that might cause. Then we could talk through them?
Also I like the paginated approach to just fetch the name of the experiment. Shouldn't be as bad to store 1M(which is super unlikely that a user creates 1M experiments) experiment names in the redux store.
I like this idea. Is there an api that returns only experiment names? Or maybe parse the full json and only retain the name?
@sunishsheth2009 an example PR in #8180 if you want to take a look.
@jmahlik 🤔 we should make sure that the solution to this issue will be both well working and intuitive to use. We will try to engage a UX designer and get some guidelines on how to solve this properly, then get back to this thread with some ideas
@jmahlik 🤔 we should make sure that the solution to this issue will be both well working and intuitive to use. We will try to engage a UX designer and get some guidelines on how to solve this properly, then get back to this thread with some ideas
That would be awesome. #8180 is not a good solution IMO, more an example of pitfalls.
Any updates on getting a UX designer involved? Fetching around 5000 experiments seems to be a tipping point since the payload is pretty large. Depending on network latency/the pc the ui runs on for the js parsing the response body.
Hi @jmahlik - I'm a UX designer and looking into this. Wondering if a simple pagination control here could help solve this issue?
We can perhaps have upto, lets say 100 runs show up per page, so users don't have to keep switching between pages and can still browse through their list of experiments. Thoughts? CC @hubertzub-db
@ridhimag11 thanks! the issue here is that the search experiments API uses cursor-based page tokens, meaning we can get the next/previous page but we don't have an indicator of how many pages/results are there. That being said, I believe we have to either
- use "Next page" / "Prev page" (just like in model list page)
or alternatively
- use "Load more" pattern like in experiment runs page
what's the better approach here?
Thanks for taking a look @ridhimag11 :).
I'll add a bit more detail. The main thing left to solve for is on loading the home page, all experiments in the experiment table are fetched and put in the redux store. This gets slow when it's a large payload.
The rest of the components expect the experiments to exist in the store (like a local copy of the database table). If all experiments don't exist in the redux store, the other components break Ex. only experiment 0-25 are fetched on load and in the redux store (so in your example page 1), then navigating directly to the link for experiment 5000 404's.
I'm pretty indifferent to how to pagination in the UI actually happens, but think we might run in to the same ordering issues regardless of the pagination style when everything isn't loaded at once.
Gotcha! Thanks @hubertzub-db and @jmahlik. @hubertzub-db from a UI standpoint, neither of the approaches are ideal here since we'd want the ability for users to go to a specific page in the list (e.g. if they want to see the oldest experiments). That being said, I'm thinking that for consistency purposes on this page (with runs list), we could go with the "load more" pattern here.
Thanks @ridhimag11! In this case, we just need to memorize the last next_page_token and use it on hitting "Load more", just like with experiment runs. @jmahlik do you think you want to tackle this one?
Thanks @ridhimag11! In this case, we just need to memorize the last next_page_token and use it on hitting "Load more", just like with experiment runs. @jmahlik do you think you want to tackle this one?
Ran in to quite a few problems last time when trying to not pull everything at once. I've been pretty pressed for time lately so don't think I'd be able to pick it up. Might be better for someone more familiar with the code base/redux.
@jmahlik Is this still a blocker on your side? Sadly we can't prioritize fixing this issue at the moment, but it will be on our radar.
It does limit the scalability of the UI generally. It's mostly future proofing for the number of experiments growing over time. So not directly blocked at the moment. Once there's around 10k experiments, that's where it starts presenting.
Guys, any updates here?
I think a realistic production use must support 1M+ experimets / runs. They may be divided over time into "hot" for more recent ones and "cold" for the old ones which will tke more time to fetch. But currently the product suffers from very bad user experience where with only 1K experimets the user experiences very slow response times that can take minutes.
Do you guys have any roadmap for improving performance? If this is interesting to you, I am willing to contribute to the subject.
Hello @progovoy Thanks for reaching out! Sadly improvements to the experiment list are not prioritized at the moment - however, it would be wonderful if you could contribute here. We can provide support where necessary.
To recap the necessary steps to improve the performance here, here's the proposal on how to fix it by implementing simple pagination based on the @ridhimag11's guidelines above:
- make the search API call experiment limit modifiable e.g. by function param (here)
- fetch fewer experiments on the initial call (e.g. 100)
- after retrieving the experiment list, check if there's
next_page_token
field present in the response and if true, show "Load more" button at the end of the list- (clicking "Load more" should perform a call similar to the initial one, but with
next_page_token
attached)
- (clicking "Load more" should perform a call similar to the initial one, but with
- wire up filter input box to the request query, i.e. implement a mechanism that will add
filter=name ILIKE "<filter-value>%"
query parameter to the search experiments GET API call (according to those docs) - make sure that the page token gets reset after changing the filter query
Does this make sense to you @progovoy ?
Hi everyone! Is there any plan for this feature? I'm looking into mlflow, and IMO when this feature is missing it really makes the UI unusable. A standard ML ecosystem deployment will have thousands of experiments which will already make the UI overloaded. Also there's no way to see experiment creation date, sort by date etc.
Hi everyone! Is there any plan for this feature? I'm looking into mlflow, and IMO when this feature is missing it really makes the UI unusable. A standard ML ecosystem deployment will have thousands of experiments which will already make the UI overloaded. Also there's no way to see experiment creation date, sort by date etc.
It requires some pretty heavy refactoring. I haven't had the bandwidth to give it another shot.