yocto-gl
yocto-gl copied to clipboard
[BUG] Mlflow returns error 504 after uploading large files (800MB +). Error with mlflow.log_artifact()
Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the issues policy
Willingness to contribute
Yes. I can contribute a fix for this bug independently.
MLflow version
- Client: 2.0.1
- Tracking server: 2.0.1
System information
OS : Red Hat Enterprise Linux release 8.6 (Ootpa) Python : 3.10.8
Describe the problem
After uploading 800 + MB files it throws 504 error at client
mlflow.log_atifacts() throws error at client if the file size is over 800MB . It succesfully uploads the file with the status finish.But at client end it throws an error 504 as below.
(mlflow_env) [userid@server-dl-login4:~/mlflow-testing] $ python run.py
MLflow version: 2.0.1
2022/12/19 18:03:09 INFO mlflow.tracking.fluent: Experiment with name 'mlflow_2_testing' does not exist. Creating a new experiment.
Tracking URI: https://mlflow-new-deploy.internal.org.cloud/
experiment_id 16
Active run_id: d18f0bd33976488d8fa34bc283c8e2a2
write output
Traceback (most recent call last):
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
[Previous line repeated 2 more times]
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='https://mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 166, in http_request
return _get_http_response_with_retries(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 97, in _get_http_response_with_retries
return session.request(method, url, **kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-new-deploy-https://mlflow-new-deploy.internal.org.cloud/', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
File "/home/userid/mlflow-testing/run.py", line 87, in <module>
mlflow.log_artifact("1.8gbfile")
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 778, in log_artifact
MlflowClient().log_artifact(run_id, local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1002, in log_artifact
self._tracking_client.log_artifact(run_id, local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 416, in log_artifact
artifact_repo.log_artifact(local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
resp = http_request(self._host_creds, endpoint, "PUT", data=f)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 184, in http_request
raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://mlflow-new-deploy.internal.org.cloud/api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile failed with exception HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses'))
Tracking information
python run.py
MLflow version: 2.0.1
Tracking URI: https://mlflow-new-deploy.internal.org.com/
experiment_id 17
System information: Linux #1 SMP Mon Jul 18 11:14:02 EDT 2022
Python version: 3.10.8
MLflow version: 2.0.1
MLflow module location: /home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: https://mlflow-new-deploy.internal.org.com/
Registry URI: https://mlflow-new-deploy.internal.org.com/
Active experiment ID: 17
Active run ID: 3a80930872d24ad0b0f245bc66d039ff
Active run artifact URI: mlflow-artifacts:/17/3a80930872d24ad0b0f245bc66d039ff/artifacts
MLflow environment variables: {
"MLFLOW_TRACKING_INSECURE_TLS": "True"
}
MLflow dependencies: {
"click": "8.1.3",
"cloudpickle": "2.2.0",
"databricks-cli": "0.17.4",
"entrypoints": "0.4",
"gitpython": "3.1.29",
"pyyaml": "6.0",
"protobuf": "4.21.11",
"pytz": "2022.7",
"requests": "2.28.1",
"packaging": "21.3",
"importlib-metadata": "5.2.0",
"sqlparse": "0.4.3",
"alembic": "1.9.0",
"docker": "6.0.0",
"Flask": "2.2.2",
"numpy": "1.23.5",
"scipy": "1.9.3",
"pandas": "1.5.2",
"querystring-parser": "1.2.4",
"sqlalchemy": "1.4.45",
"scikit-learn": "1.2.0",
"pyarrow": "10.0.1",
"shap": "0.41.0",
"markdown": "3.4.1",
"matplotlib": "3.6.2",
"gunicorn": "20.1.0",
"Jinja2": "3.1.2"
}
write output
In case of larger file output freezes at this point and later shows 504 error
==================================
Command :-
mlflow db upgrade "${BACKEND_URI}"; mlflow server --host 0.0.0.0
--backend-store-uri "${BACKEND_URI}" --artifacts-destination
"${ARTIFACT_ROOT}/mlartifacts/" --serve-artifacts --gunicorn-opts
"--log-level debug --timeout 8000 --graceful-timeout 75
--keep-alive 3600" --expose-prometheus "/mlflow/metrics"
Keep-alive and timeout is added as part of troubleshooting
Code to reproduce issue
Code :-
import os
import mlflow
from mlflow.tracking import MlflowClient
from random import random, randint
from mlflow import log_metric, log_param, log_artifacts
from mlflow.store.artifact.runs_artifact_repo import RunsArtifactRepository
from mlflow.tracking import MlflowClient
from mlflow.store.artifact.mlflow_artifacts_repo import MlflowArtifactsRepository
from mlflow.store.tracking import DEFAULT_ARTIFACTS_URI
import boto3
import requests
import sys
mlflow.set_tracking_uri('https://mlflow-new-deploy.internal.org.cloud/')
client = MlflowClient()
experiment_name= 'mlflow_2.0.1_testing'
print("MLflow version:", mlflow.__version__)
mlflow.set_experiment(experiment_name)
print("Tracking URI:", mlflow.get_tracking_uri())
experiment_id = client.get_experiment_by_name(experiment_name).experiment_id
print("experiment_id",experiment_id)
experiment = mlflow.get_experiment(experiment_id)
mlflow.start_run()
mlflow.doctor()
mlflow.log_metric("foo", 2)
mlflow.log_metric("a", 4)
print ("write output")
mlflow.log_artifact("largefile_latest")
print("Artifact URI:",mlflow.get_artifact_uri())
print("Artifact Location: {}".format(experiment.artifact_location))
artifact_uri = mlflow.get_artifact_uri()
mlflow.end_run()
Stack trace
(mlflow_env) [userid@server-dl-login4:~/mlflow-testing] $ python run.py
MLflow version: 2.0.1
2022/12/19 18:03:09 INFO mlflow.tracking.fluent: Experiment with name 'mlflow_2_testing' does not exist. Creating a new experiment.
Tracking URI: https://mlflow-new-deploy.internal.org.cloud/
experiment_id 16
Active run_id: d18f0bd33976488d8fa34bc283c8e2a2
write output
Traceback (most recent call last):
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
[Previous line repeated 2 more times]
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='https://mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 166, in http_request
return _get_http_response_with_retries(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 97, in _get_http_response_with_retries
return session.request(method, url, **kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-new-deploy-https://mlflow-new-deploy.internal.org.cloud/', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
File "/home/userid/mlflow-testing/run.py", line 87, in <module>
mlflow.log_artifact("1.8gbfile")
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 778, in log_artifact
MlflowClient().log_artifact(run_id, local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1002, in log_artifact
self._tracking_client.log_artifact(run_id, local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 416, in log_artifact
artifact_repo.log_artifact(local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
resp = http_request(self._host_creds, endpoint, "PUT", data=f)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 184, in http_request
raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://mlflow-new-deploy.internal.org.cloud/api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile failed with exception HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses'))
Other info / logs
2022/12/20 05:49:18 INFO mlflow.store.db.utils: Updating database tables
INFO [alembic.runtime.migration] Context impl MSSQLImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
[2022-12-20 05:49:22 +0000] [73] [DEBUG] Current configuration:
config: ./gunicorn.conf.py
wsgi_app: None
bind: ['0.0.0.0:5000']
backlog: 2048
workers: 4
worker_class: sync
threads: 1
worker_connections: 1000
max_requests: 0
max_requests_jitter: 0
timeout: 8000
graceful_timeout: 75
keepalive: 3600
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
print_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: None
user: 1002930000
group: 0
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: None
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
dogstatsd_tags:
statsd_prefix:
proc_name: None
default_proc_name: mlflow.server:app
pythonpath: None
paste: None
on_starting: <function OnStarting.on_starting at 0x7f7e3a55a680>
on_reload: <function OnReload.on_reload at 0x7f7e3a55a7a0>
when_ready: <function WhenReady.when_ready at 0x7f7e3a55a8c0>
pre_fork: <function Prefork.pre_fork at 0x7f7e3a55a9e0>
post_fork: <function Postfork.post_fork at 0x7f7e3a55ab00>
post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f7e3a55ac20>
worker_int: <function WorkerInt.worker_int at 0x7f7e3a55ad40>
worker_abort: <function WorkerAbort.worker_abort at 0x7f7e3a55ae60>
pre_exec: <function PreExec.pre_exec at 0x7f7e3a55af80>
pre_request: <function PreRequest.pre_request at 0x7f7e3a55b0a0>
post_request: <function PostRequest.post_request at 0x7f7e3a55b130>
child_exit: <function ChildExit.child_exit at 0x7f7e3a55b250>
worker_exit: <function WorkerExit.worker_exit at 0x7f7e3a55b370>
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f7e3a55b490>
on_exit: <function OnExit.on_exit at 0x7f7e3a55b5b0>
proxy_protocol: False
proxy_allow_ips: ['127.0.0.1']
keyfile: None
certfile: None
ssl_version: 2
cert_reqs: 0
ca_certs: None
suppress_ragged_eofs: True
do_handshake_on_connect: False
ciphers: None
raw_paste_global_conf: []
strip_header_spaces: False
[2022-12-20 05:49:22 +0000] [73] [INFO] Starting gunicorn 20.1.0
[2022-12-20 05:49:22 +0000] [73] [DEBUG] Arbiter booted
[2022-12-20 05:49:22 +0000] [73] [INFO] Listening at: http://0.0.0.0:5000 (73)
[2022-12-20 05:49:22 +0000] [73] [INFO] Using worker: sync
[2022-12-20 05:49:22 +0000] [74] [INFO] Booting worker with pid: 74
[2022-12-20 05:49:22 +0000] [75] [INFO] Booting worker with pid: 75
[2022-12-20 05:49:22 +0000] [76] [INFO] Booting worker with pid: 76
[2022-12-20 05:49:22 +0000] [77] [INFO] Booting worker with pid: 77
[2022-12-20 05:49:22 +0000] [73] [DEBUG] 4 workers
[2022-12-20 05:49:33 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:57:30 +0000] [77] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-20 05:57:30 +0000] [77] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-20 05:57:30 +0000] [77] [DEBUG] GET /api/2.0/mlflow/experiments/get
[2022-12-20 05:57:30 +0000] [74] [DEBUG] POST /api/2.0/mlflow/runs/create
[2022-12-20 05:57:30 +0000] [74] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-20 05:57:30 +0000] [74] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-20 05:57:30 +0000] [74] [DEBUG] GET /api/2.0/mlflow/runs/get
[2022-12-20 05:57:30 +0000] [74] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 05:57:38 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:57:47 +0000] [77] [DEBUG] GET /static-files/static/media/fontawesome-webfont.20fd1704ea223900efa9.woff2
[2022-12-20 05:57:49 +0000] [76] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 05:57:49 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:57:49 +0000] [77] [DEBUG] GET /static-files/static/js/547.a604119a.chunk.js
[2022-12-20 05:57:57 +0000] [76] [DEBUG] GET /static-files/static/media/laptop.f3a6b3016fbf319305f629fcbcf937a9.svg
[2022-12-20 05:58:16 +0000] [76] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 05:58:24 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:58:45 +0000] [74] [DEBUG] Ignoring connection reset
[2022-12-20 05:59:25 +0000] [75] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 05:59:28 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:59:38 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:59:40 +0000] [76] [DEBUG] Ignoring connection reset
[2022-12-20 06:00:28 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:00:28 +0000] [76] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 06:00:38 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:00:41 +0000] [75] [DEBUG] Ignoring connection reset
[2022-12-20 06:00:42 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:01:40 +0000] [76] [DEBUG] Ignoring connection reset
[2022-12-20 06:01:40 +0000] [77] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 06:01:42 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:46 +0000] [77] [DEBUG] Ignoring connection reset
[2022-12-20 06:02:49 +0000] [77] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/search
[2022-12-20 06:02:51 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:51 +0000] [75] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:02:51 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:52 +0000] [77] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:02:53 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:54 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:03:00 +0000] [75] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:03:00 +0000] [76] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 06:03:01 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:03:49 +0000] [77] [DEBUG] POST /api/2.0/mlflow/runs/update
[2022-12-20 06:03:54 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:04:02 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:04:09 +0000] [77] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:04:12 +0000] [76] [DEBUG] Ignoring connection reset
[2022-12-20 06:04:16 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:08:26 +0000] [77] [DEBUG] GET //
[2022-12-20 06:08:26 +0000] [75] [DEBUG] GET //static-files/static/css/main.3b6f4584.css
[2022-12-20 06:08:26 +0000] [77] [DEBUG] GET //static-files/static/js/main.6125589f.js
[2022-12-20 06:08:34 +0000] [77] [DEBUG] GET //ajax-api/2.0/mlflow/experiments/search
[2022-12-20 06:08:34 +0000] [74] [DEBUG] GET //static-files/static/media/home-logo.b14e3dd7dc63ea1769c6.png
[2022-12-20 06:08:34 +0000] [75] [DEBUG] GET //static-files/static/js/714.c7ed3611.chunk.js
[2022-12-20 06:08:35 +0000] [75] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:35 +0000] [77] [DEBUG] POST //ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:08:35 +0000] [76] [DEBUG] GET //ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:08:35 +0000] [74] [DEBUG] GET //static-files/static/css/547.f3323e81.chunk.css
[2022-12-20 06:08:35 +0000] [76] [DEBUG] POST //ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:08:35 +0000] [76] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:36 +0000] [74] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:36 +0000] [74] [DEBUG] GET //static-files/static/js/547.a604119a.chunk.js
[2022-12-20 06:08:36 +0000] [76] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:36 +0000] [77] [DEBUG] GET //static-files/static/js/869.aae22f22.chunk.js
[2022-12-20 06:08:39 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:10:46 +0000] [76] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:10:47 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
What component(s) does this bug affect?
- [X]
area/artifacts
: Artifact stores and artifact logging - [X]
area/build
: Build and test infrastructure for MLflow - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [X]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
@Neethugkp Can you clean up the issue description? Please wrap code and logs with a code block.
@harupy :- updated. Hope its readable.
@Neethugkp Can you upload a small file?
@harupy . Upload of filesize less than 750 MB works without any error. Larger file also gets uploaded in s3. But after the file upload is successful, code execution abruptly terminates with 504 error. (lines after mlflow.log_artifact(largefile) are not executed.)
Mlflow logs of smaller file upload
[2022-12-22 17:03:04 +0000] [74] [INFO] Starting gunicorn 20.1.0
[2022-12-22 17:03:04 +0000] [74] [DEBUG] Arbiter booted
[2022-12-22 17:03:04 +0000] [74] [INFO] Listening at: http://0.0.0.0:5000 (74)
[2022-12-22 17:03:04 +0000] [74] [INFO] Using worker: sync
[2022-12-22 17:03:04 +0000] [75] [INFO] Booting worker with pid: 75
[2022-12-22 17:03:04 +0000] [76] [INFO] Booting worker with pid: 76
[2022-12-22 17:03:04 +0000] [77] [INFO] Booting worker with pid: 77
[2022-12-22 17:03:04 +0000] [74] [DEBUG] 4 workers
[2022-12-22 17:03:04 +0000] [78] [INFO] Booting worker with pid: 78
[2022-12-22 17:07:11 +0000] [76] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-22 17:07:13 +0000] [78] [DEBUG] POST /api/2.0/mlflow/experiments/create
[2022-12-22 17:07:15 +0000] [75] [DEBUG] GET /api/2.0/mlflow/experiments/get
[2022-12-22 17:07:18 +0000] [76] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-22 17:07:21 +0000] [75] [DEBUG] GET /api/2.0/mlflow/experiments/get
[2022-12-22 17:07:25 +0000] [76] [DEBUG] POST /api/2.0/mlflow/runs/create
[2022-12-22 17:07:29 +0000] [75] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-22 17:07:31 +0000] [77] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-22 17:07:34 +0000] [75] [DEBUG] GET /api/2.0/mlflow/runs/get
[2022-12-22 17:07:36 +0000] [78] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/18/edec56b50a5b47f194b2cbc8333f2625/artifacts/output.txt
[2022-12-22 17:07:39 +0000] [75] [DEBUG] POST /api/2.0/mlflow/runs/update
@Neethugkp Increasing MLFLOW_HTTP_REQUEST_TIMEOUT
(environment variable, default value: 120) might help.
MLFLOW_HTTP_REQUEST_TIMEOUT=360 python run.py
@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.
@Neethugkp Increasing
MLFLOW_HTTP_REQUEST_TIMEOUT
(environment variable, default value: 120) might help.MLFLOW_HTTP_REQUEST_TIMEOUT=360 python run.py
@harupy MLFLOW_HTTP_REQUEST_TIMEOUT=360 environment variable is set.
There is no change with upload of large files. after the command mlflow.log_artifact(largefile), execution terminates with 504 error.
Traceback (most recent call last):
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
[Previous line repeated 2 more times]
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest (Caused by ResponseError('too many 504 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 166, in http_request
return _get_http_response_with_retries(
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 97, in _get_http_response_with_retries
return session.request(method, url, **kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest (Caused by ResponseError('too many 504 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/userid/mlflow-testing/run.py", line 88, in <module>
mlflow.log_artifact("largefile_latest")
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 778, in log_artifact
MlflowClient().log_artifact(run_id, local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1002, in log_artifact
self._tracking_client.log_artifact(run_id, local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 416, in log_artifact
artifact_repo.log_artifact(local_path, artifact_path)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
resp = http_request(self._host_creds, endpoint, "PUT", data=f)
File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 184, in http_request
raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://mlflow-new-deploy.internal.org.cloud/api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest failed with exception HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest (Caused by ResponseError('too many 504 error responses'))
@BenWilson2 @dbczumar @harupy @WeichenXu123 We have tested the latest version 2.2.1.
Large files are getting uploaded(tested up to logging 3GB files ). But at the end of execution its throws error 504 at client
MLflow version: 2.2.1
Tracking URI: https://mlflow-rare23.internal.cloud/
experiment_id 3
System information: Linux #1 SMP Wed Dec 14 16:00:01 EST 2022
Python version: 3.10.8
MLflow version: 2.2.1
MLflow module location: /home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: https://mlflow-rare23.internal.cloud/
Registry URI: https://mlflow-rare23.internal.cloud/
Active experiment ID: 3
Active run ID: 0ddf506217ca487c9a6493de3939c992
Active run artifact URI: mlflow-artifacts:/3/0ddf506217ca487c9a6493de3939c992/artifacts
MLflow environment variables:
MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT: 1200
MLFLOW_TRACKING_INSECURE_TLS: True
MLflow dependencies:
Flask: 2.2.2
Jinja2: 3.1.2
alembic: 1.9.1
click: 8.1.3
cloudpickle: 2.2.1
databricks-cli: 0.17.4
docker: 6.0.0
entrypoints: 0.4
gitpython: 3.1.30
gunicorn: 20.1.0
importlib-metadata: 4.11.4
markdown: 3.4.1
matplotlib: 3.6.3
numpy: 1.23.5
packaging: 22.0
pandas: 1.5.3
protobuf: 4.21.12
pyarrow: 10.0.1
pytz: 2022.7.1
pyyaml: 6.0
querystring-parser: 1.2.4
requests: 2.28.2
scikit-learn: 1.2.1
scipy: 1.10.0
shap: 0.41.0
sqlalchemy: 1.4.46
sqlparse: 0.4.3
write output
Traceback (most recent call last):
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
[Previous line repeated 2 more times]
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mlflow-rare23.internal.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile (Caused by ResponseError('too many 504 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 167, in http_request
return _get_http_response_with_retries(
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 98, in _get_http_response_with_retries
return session.request(method, url, **kwargs)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-rare23.internal.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile (Caused by ResponseError('too many 504 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/<userid>/mlflow-testing/run.py", line 98, in <module>
mlflow.log_artifact("largefile")
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 783, in log_artifact
MlflowClient().log_artifact(run_id, local_path, artifact_path)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1023, in log_artifact
self._tracking_client.log_artifact(run_id, local_path, artifact_path)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 439, in log_artifact
artifact_repo.log_artifact(local_path, artifact_path)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
resp = http_request(self._host_creds, endpoint, "PUT", data=f)
File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 185, in http_request
raise MlflowException(f"API request to {url} failed with exception {e}")
mlflow.exceptions.MlflowException: API request to https://mlflow-rare23.internal.cloud/api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile failed with exception HTTPSConnectionPool(host='mlflow-rare23.internal.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile (Caused by ResponseError('too many 504 error responses'))
Same here with mlflow 2.4.0 (client) and mlflow 2.4.1 (server) and a file of 830 MB. Any insights or workarounds?
Same here with mlflow 2.4.0 (client) and mlflow 2.4.1 (server) and a file of 830 MB. Any insights or workarounds?
You can work around it by replacing the call to mlflow.log_artifacts() with a sequence of calls to mlflow.log_artifact() for all files in the local directory recursively, while catching and ignoring the exception for large files. Not exactly pretty but seems to do the job for now. For reference:
def log_artifacts(local_dir: str, artifact_path: str):
for root, _, files in os.walk(local_dir):
for file in files:
upload_path = artifact_path
if root != local_dir:
rel_path = os.path.relpath(root, local_dir)
upload_path = os.path.join(upload_path, rel_path)
try:
mlflow.log_artifact(local_path=os.path.join(root, file), artifact_path=upload_path)
except mlflow.exceptions.MlflowException as ex:
if os.path.getsize(os.path.join(root, file)) > 750 * pow(1000, 2):
# Workaround for https://github.com/mlflow/mlflow/issues/7564: Just ignore it
pass
else:
raise ex
@Earthwings The workaround that worked for us was by increasing the resource limits (cpu/memory). From version 2.2.0 onwards it works.
@harupy Would like to reopen the issue as it fails to upload files with more than 3GB files.
In older versions it failed to upload files more than 800 MB . As a workaround resource limits were increased and it fixed the problem for file uploads up to 2.5 GB.
Request you to consider a permanent fix for the issue addressed. Could you please incorporate the workaround provided by Earthwings in the comment section https://github.com/mlflow/mlflow/issues/7564#issuecomment-1717949904
I'm experiencing a similar issue while trying to log a model of size 3.7 GB using MLflow version 2.7.1. If you could assist, I'd greatly appreciate it.
Did you try the workaround from https://github.com/mlflow/mlflow/issues/7564#issuecomment-1717949904 yet?
My issue got fixed by running the server with --gunicorn-opts="--timeout 900"
following the comment given in the link
My issue got fixed by running the server with
--gunicorn-opts="--timeout 900"
following the comment given in the link
This worked for me too. Thank you for sharing!