yocto-gl MLflow worker timeout when opening UI

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.5
MLflow installed from (source or binary): pip install mlflow
MLflow version (run mlflow --version): mlflow, version 0.8.2
Python version: Python 3.6.6 :: Anaconda, Inc.
**npm version (if running the dev UI):
Exact command to reproduce: mlflow server --file-store /bigdata/mlflow --host 0.0.0.0

Describe the problem

MLflow UI shows Niagara falls with "Oops! Something went wrong" every time I try opening it. I've been using it for two months, but recently it has started crashing until today I cannot get the UI to open at all anymore.

Logs

server logs after fresh restart:

[2019-02-26 12:34:36 +0000] [9] [INFO] Starting gunicorn 19.9.0
[2019-02-26 12:34:36 +0000] [9] [INFO] Listening at: http://0.0.0.0:5000 (9)
[2019-02-26 12:34:36 +0000] [9] [INFO] Using worker: sync
[2019-02-26 12:34:36 +0000] [12] [INFO] Booting worker with pid: 12
[2019-02-26 12:34:36 +0000] [14] [INFO] Booting worker with pid: 14
[2019-02-26 12:34:36 +0000] [15] [INFO] Booting worker with pid: 15
[2019-02-26 12:34:36 +0000] [18] [INFO] Booting worker with pid: 18
[2019-02-26 12:35:30 +0000] [9] [CRITICAL] WORKER TIMEOUT (pid:14)
[2019-02-26 12:35:30 +0000] [14] [INFO] Worker exiting (pid: 14)
[2019-02-26 12:35:30 +0000] [28] [INFO] Booting worker with pid: 28

browser console logs when opening UI:

setupAjaxHeaders.js:22 
{_xsrf: "2|a583f945|b32757069a3ea1c54e37f87dba1c1428|1549020795"}
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
jquery.js:9355 POST http://localhost:5000/ajax-api/2.0/preview/mlflow/runs/search net::ERR_EMPTY_RESPONSE
Actions.js:155 XHR failed 
{readyState: 0, getResponseHeader: ƒ, getAllResponseHeaders: ƒ, setRequestHeader: ƒ, overrideMimeType: ƒ, …}
react-dom.production.min.js:151 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
AppErrorBoundary.js:19 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
:5000/#/experiments/1:1 Uncaught (in promise) 
t {xhr: {…}}
```

Feb 26 '19 12:02 jseppanen

Same problem here! Started a parameter search before the weekend and have therefore run far more experiments than before. Now I can't start the ui anymore. Where do I start trouble shooting?

Jun 17 '19 07:06 5ke

It turns out that the ui simply doesn't handle too many runs (in my case it starts struggling when mlruns contains more than circa 1000 experiments). Around this threshold the ui becomes unstable (sometimes crashes, sometimes works, but it's never quick&responsive), and eventually there are too many runs and it won't load at all.

This goes a bit against the philosophy of being able to track all your experiments.

Would using a local db instead of file storage help? Hosting externally is not an option for me.

As a side note: during troubleshooting I discovered that when you move runs around into different folders, it's important to update the artifact_location parameter in the main meta.yaml, otherwise you'll experience a different type of crash, without a clear warning.

Jun 17 '19 09:06 5ke

same here. ui either timeouts, or crashes with about 650 runs... sometimes works, mostly doesn't.

Jul 22 '19 18:07 selimonat

In Mlflow 1.1 the runs listing has been changed to show the first 100 runs + a "Load more" button if you have more than 100. Could you please try and see if that makes your situation better?

Jul 26 '19 22:07 Zangr

We were checking that out few days ago, got exactly the same error.

Jul 27 '19 08:07 selimonat

So I just started running into to this issue after upgrading to 1.0.0+ and noticed that the URL for static files is incorrect causing worker to block. Basically if I switch: http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css to: http://localhost:5000/static-files/static/css/main.fbf8a477.css These assets load fine. Anyone have a patch for this?

Aug 08 '19 16:08 CrankyDragon

Any update on this issue? It still exists on v1.2.0.0

Aug 28 '19 05:08 datsabk

This is unexpectedly unpleasant. I did a number of runs with the idea of sorting best-to-worst metrics afterwards. But the UI indeed crashes after more than ~1000 runs...

Moreover, it only has to load the first 100 runs and show "load more" afterwards.

Sep 12 '19 09:09 pranasziaukas

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 03 '19 09:10 stale[bot]

This issue has not been interacted with for 60 days since it was marked as stale! As such, the issue is now closed. If you have this issue or more information, please re-open the issue!

Dec 02 '19 10:12 stale[bot]

Still getting this issue in MLFlow 1.4. Is the situation improved by using a database backend?

Jan 13 '20 03:01 Nintorac

getting issue with 1.7 and postgres backend

Mar 24 '20 10:03 HungUnicorn

It is the same for 1.7 without postgres backend. Could this issue be re-opened?

Apr 29 '20 11:04 gkonstanty

I have just upgraded server to run on 1.9.0 version (without postgres backend) and nothing has changed.

Adding --gunicorn-opts "--timeout 180" has somehow helped, but the number of our experiments is constantly growing, so even 180sec will be not sufficient soon. And waiting so long for some simple queries results is kinda annoying.

Could you please check this issue?

Jun 23 '20 17:06 gkonstanty

same issue, number of runs is ~50. In my case this machine also runs Tensorboard which uses a lot of RAM - look like less available RAM makes this issue more severe.

Jul 06 '20 13:07 faddey-w

I face the same issue with version 1.10.0 and file system. All files are generated as expected, but the same "WORKIER TIMEOUT" message returns when I try to access individual records (i.e., clicking date's hyperlinks).

Aug 19 '20 02:08 tmywada

Reopening this issue as per community request and reassign priority to get it into the queue.

Aug 20 '20 19:08 Zangr

@gkonstanty : where are you adding the --gunicorn-opts "--timeout 180" option?

Sep 22 '20 16:09 spott

@spott @gkonstanty : where are you adding the --gunicorn-opts "--timeout 180" option?

I couldn't get mlflow ui --gunicorn-opts "--timeout 180" to work either (error: no such option --gunicorn-opts)

But the following worked for me: GUNICORN_CMD_ARGS="--timeout 180" mlflow ui

Oct 01 '20 09:10 5ke

Sorry, @spott, I missed your msg.

I'm adding it to mlflow server:

mlflow server --host 0.0.0.0 -p 5000 --backend-store-uri /mlflow/data/ --default-artifact-root /mlflow/artifacts/ --gunicorn-opts "--timeout 180"

Oct 01 '20 11:10 gkonstanty

I have the same issue (version 1.14.1) and I use Postgres as the backend database. When I access the UI, the list of experiments doesn't load and the spinner just keeps spinning. To me it feels like it could not query the database to display the data.

It happens very often (> 50% of the time when I try to access the UI, especially after a run is finished) and I don't even have a lot of experiments (<10 per collection, 2 collections in total).

Mar 15 '21 20:03 kenimou

mlflow 1.18, year 2021, issue is still here..

Aug 03 '21 08:08 sergeyleyko

mlflow 1.21 same issue...

Nov 22 '21 10:11 daniel-beyond

I received somewhat better performance with this -

mlflow server --backend-store-uri=postgresql://postgres:${RDS_PASSWORD}@${RDS_HOST}:5432/mlflow --default-artifact-root=${ARTIFACT_STORE} --host 0.0.0.0 --port 5000 --gunicorn-opts "--worker-class gevent --threads 3 --workers 3 --timeout 300 --keep-alive 300 --log-level INFO"

Mar 01 '22 07:03 dprateek1991

mlflow 1.26.1 still same issue! Can you please provide a workaround or something? Even setting gunicorn-opts is not helping.

Jun 23 '22 15:06 marioGab

It seems like limiting the initial experiments displayed and having a "load more" button similar to how the runs pages work might help.

The issue indeed presents only when there are about 1000+ experiments. The python apis continue to function fine with a large number of experiments, it's only the ui/js side that gets bogged down.

I tracked down how loading more runs was implemented https://github.com/mlflow/mlflow/pull/1564. Maybe some of the implementation could be borrowed?

Jun 28 '22 13:06 jmahlik

I only have 3 to 5 experiments but still get this error. I just have alot of metrics and parameters in each experiment. So the "load more" idea is not possible.

Jun 28 '22 17:06 marioGab

Well I found one workaround here. Try mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME --gunicorn-opts "--timeout 0"

This will wait till the data transfer finishes and loads, I have to delete a few of my experiments from my S3 bucket to make it load a little faster which I think is not a very welcoming workaround.

Will have to wait for a permanent fix to this.

Jun 28 '22 17:06 alokpadhi

With a large timeout set, our problem seems to be only with the ui generating the experiments table with 1000+ experiments. I wonder if defaulting the experiments side bar to hidden would help as a short term fix? Collapsing the sidebar seems to fix the problem (after waiting a few minutes for it to load). Yes, it would break almost immediately when someone clicks to expand it, but a user could still use the ui if they knew the experiment id before hand.

Maybe these are really two separate issues. One for large runs and one for experiments on the home page?

Jul 20 '22 17:07 jmahlik

The problem is not only with a large number of experiments but with a large number of experiment metrics logged, like in my case. Thus the issue is due to a bad UI design, and I suppose it can be fixed with big refactoring only.

Jul 21 '22 07:07 sergeyleyko

yocto-gl yocto-gl copied to clipboard

MLflow worker timeout when opening UI

System information

Describe the problem

Logs

yocto-gl
yocto-gl copied to clipboard