yocto-gl
yocto-gl copied to clipboard
MLflow worker timeout when opening UI
System information
- Have I written custom code (as opposed to using a stock example script provided in MLflow): no
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.5
- MLflow installed from (source or binary): pip install mlflow
-
MLflow version (run
mlflow --version
): mlflow, version 0.8.2 - Python version: Python 3.6.6 :: Anaconda, Inc.
- **npm version (if running the dev UI):
-
Exact command to reproduce:
mlflow server --file-store /bigdata/mlflow --host 0.0.0.0
Describe the problem
MLflow UI shows Niagara falls with "Oops! Something went wrong" every time I try opening it. I've been using it for two months, but recently it has started crashing until today I cannot get the UI to open at all anymore.
Logs
server logs after fresh restart:
[2019-02-26 12:34:36 +0000] [9] [INFO] Starting gunicorn 19.9.0
[2019-02-26 12:34:36 +0000] [9] [INFO] Listening at: http://0.0.0.0:5000 (9)
[2019-02-26 12:34:36 +0000] [9] [INFO] Using worker: sync
[2019-02-26 12:34:36 +0000] [12] [INFO] Booting worker with pid: 12
[2019-02-26 12:34:36 +0000] [14] [INFO] Booting worker with pid: 14
[2019-02-26 12:34:36 +0000] [15] [INFO] Booting worker with pid: 15
[2019-02-26 12:34:36 +0000] [18] [INFO] Booting worker with pid: 18
[2019-02-26 12:35:30 +0000] [9] [CRITICAL] WORKER TIMEOUT (pid:14)
[2019-02-26 12:35:30 +0000] [14] [INFO] Worker exiting (pid: 14)
[2019-02-26 12:35:30 +0000] [28] [INFO] Booting worker with pid: 28
browser console logs when opening UI:
setupAjaxHeaders.js:22
{_xsrf: "2|a583f945|b32757069a3ea1c54e37f87dba1c1428|1549020795"}
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
at service-worker.js:1
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
at service-worker.js:1
jquery.js:9355 POST http://localhost:5000/ajax-api/2.0/preview/mlflow/runs/search net::ERR_EMPTY_RESPONSE
Actions.js:155 XHR failed
{readyState: 0, getResponseHeader: ƒ, getAllResponseHeaders: ƒ, setRequestHeader: ƒ, overrideMimeType: ƒ, …}
react-dom.production.min.js:151 TypeError: Cannot read property 'getErrorCode' of undefined
at errorRenderFunc (ExperimentPage.js:122)
at e.value (RequestStateWrapper.js:51)
at f (react-dom.production.min.js:131)
at beginWork (react-dom.production.min.js:138)
at o (react-dom.production.min.js:176)
at a (react-dom.production.min.js:176)
at x (react-dom.production.min.js:182)
at y (react-dom.production.min.js:181)
at v (react-dom.production.min.js:181)
at d (react-dom.production.min.js:180)
AppErrorBoundary.js:19 TypeError: Cannot read property 'getErrorCode' of undefined
at errorRenderFunc (ExperimentPage.js:122)
at e.value (RequestStateWrapper.js:51)
at f (react-dom.production.min.js:131)
at beginWork (react-dom.production.min.js:138)
at o (react-dom.production.min.js:176)
at a (react-dom.production.min.js:176)
at x (react-dom.production.min.js:182)
at y (react-dom.production.min.js:181)
at v (react-dom.production.min.js:181)
at d (react-dom.production.min.js:180)
:5000/#/experiments/1:1 Uncaught (in promise)
t {xhr: {…}}
```
Same problem here! Started a parameter search before the weekend and have therefore run far more experiments than before. Now I can't start the ui anymore. Where do I start trouble shooting?
It turns out that the ui simply doesn't handle too many runs (in my case it starts struggling when mlruns contains more than circa 1000 experiments). Around this threshold the ui becomes unstable (sometimes crashes, sometimes works, but it's never quick&responsive), and eventually there are too many runs and it won't load at all.
This goes a bit against the philosophy of being able to track all your experiments.
Would using a local db instead of file storage help? Hosting externally is not an option for me.
As a side note: during troubleshooting I discovered that when you move runs around into different folders, it's important to update the artifact_location parameter in the main meta.yaml, otherwise you'll experience a different type of crash, without a clear warning.
same here. ui either timeouts, or crashes with about 650 runs... sometimes works, mostly doesn't.
In Mlflow 1.1 the runs listing has been changed to show the first 100 runs + a "Load more" button if you have more than 100. Could you please try and see if that makes your situation better?
We were checking that out few days ago, got exactly the same error.
So I just started running into to this issue after upgrading to 1.0.0+ and noticed that the URL for static files is incorrect causing worker to block. Basically if I switch: http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css to: http://localhost:5000/static-files/static/css/main.fbf8a477.css These assets load fine. Anyone have a patch for this?
Any update on this issue? It still exists on v1.2.0.0
This is unexpectedly unpleasant. I did a number of runs with the idea of sorting best-to-worst metrics afterwards. But the UI indeed crashes after more than ~1000 runs...
Moreover, it only has to load the first 100 runs and show "load more" afterwards.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has not been interacted with for 60 days since it was marked as stale! As such, the issue is now closed. If you have this issue or more information, please re-open the issue!
Still getting this issue in MLFlow 1.4. Is the situation improved by using a database backend?
getting issue with 1.7 and postgres backend
It is the same for 1.7 without postgres backend. Could this issue be re-opened?
I have just upgraded server to run on 1.9.0 version (without postgres backend) and nothing has changed.
Adding --gunicorn-opts "--timeout 180"
has somehow helped, but the number of our experiments is constantly growing, so even 180sec will be not sufficient soon. And waiting so long for some simple queries results is kinda annoying.
Could you please check this issue?
same issue, number of runs is ~50. In my case this machine also runs Tensorboard which uses a lot of RAM - look like less available RAM makes this issue more severe.
I face the same issue with version 1.10.0 and file system. All files are generated as expected, but the same "WORKIER TIMEOUT" message returns when I try to access individual records (i.e., clicking date's hyperlinks).
Reopening this issue as per community request and reassign priority to get it into the queue.
@gkonstanty : where are you adding the --gunicorn-opts "--timeout 180"
option?
@spott @gkonstanty : where are you adding the
--gunicorn-opts "--timeout 180"
option?
I couldn't get mlflow ui --gunicorn-opts "--timeout 180"
to work either (error: no such option --gunicorn-opts
)
But the following worked for me:
GUNICORN_CMD_ARGS="--timeout 180" mlflow ui
Sorry, @spott, I missed your msg.
I'm adding it to mlflow server
:
mlflow server --host 0.0.0.0 -p 5000 --backend-store-uri /mlflow/data/ --default-artifact-root /mlflow/artifacts/ --gunicorn-opts "--timeout 180"
I have the same issue (version 1.14.1) and I use Postgres as the backend database. When I access the UI, the list of experiments doesn't load and the spinner just keeps spinning. To me it feels like it could not query the database to display the data.
It happens very often (> 50% of the time when I try to access the UI, especially after a run is finished) and I don't even have a lot of experiments (<10 per collection, 2 collections in total).
mlflow 1.18, year 2021, issue is still here..
mlflow 1.21 same issue...
I received somewhat better performance with this -
mlflow server --backend-store-uri=postgresql://postgres:${RDS_PASSWORD}@${RDS_HOST}:5432/mlflow --default-artifact-root=${ARTIFACT_STORE} --host 0.0.0.0 --port 5000 --gunicorn-opts "--worker-class gevent --threads 3 --workers 3 --timeout 300 --keep-alive 300 --log-level INFO"
mlflow 1.26.1 still same issue! Can you please provide a workaround or something? Even setting gunicorn-opts is not helping.
It seems like limiting the initial experiments displayed and having a "load more" button similar to how the runs pages work might help.
The issue indeed presents only when there are about 1000+ experiments. The python apis continue to function fine with a large number of experiments, it's only the ui/js side that gets bogged down.
I tracked down how loading more runs was implemented https://github.com/mlflow/mlflow/pull/1564. Maybe some of the implementation could be borrowed?
I only have 3 to 5 experiments but still get this error. I just have alot of metrics and parameters in each experiment. So the "load more" idea is not possible.
Well I found one workaround here. Try mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME --gunicorn-opts "--timeout 0"
This will wait till the data transfer finishes and loads, I have to delete a few of my experiments from my S3 bucket to make it load a little faster which I think is not a very welcoming workaround.
Will have to wait for a permanent fix to this.
With a large timeout set, our problem seems to be only with the ui generating the experiments table with 1000+ experiments. I wonder if defaulting the experiments side bar to hidden would help as a short term fix? Collapsing the sidebar seems to fix the problem (after waiting a few minutes for it to load). Yes, it would break almost immediately when someone clicks to expand it, but a user could still use the ui if they knew the experiment id before hand.
Maybe these are really two separate issues. One for large runs and one for experiments on the home page?
The problem is not only with a large number of experiments but with a large number of experiment metrics logged, like in my case. Thus the issue is due to a bad UI design, and I suppose it can be fixed with big refactoring only.