yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] Mlflow returns error 504 after uploading large files (800MB +). Error with mlflow.log_artifact()

Open Neethugkp opened this issue 2 years ago • 17 comments

Issues Policy acknowledgement

  • [X] I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

Yes. I can contribute a fix for this bug independently.

MLflow version

  • Client: 2.0.1
  • Tracking server: 2.0.1

System information

OS : Red Hat Enterprise Linux release 8.6 (Ootpa) Python : 3.10.8

Describe the problem

After uploading 800 + MB files it throws 504 error at client

mlflow.log_atifacts() throws error at client if the file size is over 800MB . It succesfully uploads the file with the status finish.But at client end it throws an error 504 as below.

(mlflow_env) [userid@server-dl-login4:~/mlflow-testing] $ python run.py
MLflow version: 2.0.1
2022/12/19 18:03:09 INFO mlflow.tracking.fluent: Experiment with name 'mlflow_2_testing' does not exist. Creating a new experiment.
Tracking URI: https://mlflow-new-deploy.internal.org.cloud/
experiment_id 16
Active run_id: d18f0bd33976488d8fa34bc283c8e2a2
write output
Traceback (most recent call last):
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='https://mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 166, in http_request
    return _get_http_response_with_retries(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 97, in _get_http_response_with_retries
    return session.request(method, url, **kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-new-deploy-https://mlflow-new-deploy.internal.org.cloud/', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
  File "/home/userid/mlflow-testing/run.py", line 87, in <module>
    mlflow.log_artifact("1.8gbfile")
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 778, in log_artifact
    MlflowClient().log_artifact(run_id, local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1002, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 416, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
    resp = http_request(self._host_creds, endpoint, "PUT", data=f)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 184, in http_request
    raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://mlflow-new-deploy.internal.org.cloud/api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile failed with exception HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses'))

Tracking information

python run.py
MLflow version: 2.0.1
Tracking URI: https://mlflow-new-deploy.internal.org.com/
experiment_id 17
System information: Linux #1 SMP Mon Jul 18 11:14:02 EDT 2022
Python version: 3.10.8
MLflow version: 2.0.1
MLflow module location: /home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: https://mlflow-new-deploy.internal.org.com/
Registry URI: https://mlflow-new-deploy.internal.org.com/
Active experiment ID: 17
Active run ID: 3a80930872d24ad0b0f245bc66d039ff
Active run artifact URI: mlflow-artifacts:/17/3a80930872d24ad0b0f245bc66d039ff/artifacts
MLflow environment variables: {
    "MLFLOW_TRACKING_INSECURE_TLS": "True"
}
MLflow dependencies: {
    "click": "8.1.3",
    "cloudpickle": "2.2.0",
    "databricks-cli": "0.17.4",
    "entrypoints": "0.4",
    "gitpython": "3.1.29",
    "pyyaml": "6.0",
    "protobuf": "4.21.11",
    "pytz": "2022.7",
    "requests": "2.28.1",
    "packaging": "21.3",
    "importlib-metadata": "5.2.0",
    "sqlparse": "0.4.3",
    "alembic": "1.9.0",
    "docker": "6.0.0",
    "Flask": "2.2.2",
    "numpy": "1.23.5",
    "scipy": "1.9.3",
    "pandas": "1.5.2",
    "querystring-parser": "1.2.4",
    "sqlalchemy": "1.4.45",
    "scikit-learn": "1.2.0",
    "pyarrow": "10.0.1",
    "shap": "0.41.0",
    "markdown": "3.4.1",
    "matplotlib": "3.6.2",
    "gunicorn": "20.1.0",
    "Jinja2": "3.1.2"
}
write output

In case of larger file output freezes at this point and later shows 504 error

==================================

Command :-

          mlflow db upgrade "${BACKEND_URI}"; mlflow server --host 0.0.0.0
          --backend-store-uri "${BACKEND_URI}" --artifacts-destination
          "${ARTIFACT_ROOT}/mlartifacts/" --serve-artifacts --gunicorn-opts
          "--log-level debug --timeout 8000 --graceful-timeout 75
          --keep-alive 3600" --expose-prometheus "/mlflow/metrics"  

Keep-alive and timeout is added as part of troubleshooting

Code to reproduce issue

Code :-

import os
import mlflow
from mlflow.tracking import MlflowClient
from random import random, randint
from mlflow import log_metric, log_param, log_artifacts
from mlflow.store.artifact.runs_artifact_repo import RunsArtifactRepository
from mlflow.tracking import MlflowClient
from mlflow.store.artifact.mlflow_artifacts_repo import MlflowArtifactsRepository
from mlflow.store.tracking import DEFAULT_ARTIFACTS_URI
import boto3
import requests
import sys

mlflow.set_tracking_uri('https://mlflow-new-deploy.internal.org.cloud/')
client = MlflowClient()
experiment_name= 'mlflow_2.0.1_testing'

print("MLflow version:", mlflow.__version__)

mlflow.set_experiment(experiment_name)

print("Tracking URI:", mlflow.get_tracking_uri())

experiment_id = client.get_experiment_by_name(experiment_name).experiment_id
print("experiment_id",experiment_id)
experiment = mlflow.get_experiment(experiment_id)
mlflow.start_run()
mlflow.doctor()
mlflow.log_metric("foo", 2)
mlflow.log_metric("a", 4)
print ("write output")
mlflow.log_artifact("largefile_latest")
print("Artifact URI:",mlflow.get_artifact_uri())
print("Artifact Location: {}".format(experiment.artifact_location))

artifact_uri = mlflow.get_artifact_uri()
mlflow.end_run()


Stack trace

(mlflow_env) [userid@server-dl-login4:~/mlflow-testing] $ python run.py
MLflow version: 2.0.1
2022/12/19 18:03:09 INFO mlflow.tracking.fluent: Experiment with name 'mlflow_2_testing' does not exist. Creating a new experiment.
Tracking URI: https://mlflow-new-deploy.internal.org.cloud/
experiment_id 16
Active run_id: d18f0bd33976488d8fa34bc283c8e2a2
write output
Traceback (most recent call last):
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='https://mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 166, in http_request
    return _get_http_response_with_retries(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 97, in _get_http_response_with_retries
    return session.request(method, url, **kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-new-deploy-https://mlflow-new-deploy.internal.org.cloud/', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last):
  File "/home/userid/mlflow-testing/run.py", line 87, in <module>
    mlflow.log_artifact("1.8gbfile")
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 778, in log_artifact
    MlflowClient().log_artifact(run_id, local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1002, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 416, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
    resp = http_request(self._host_creds, endpoint, "PUT", data=f)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 184, in http_request
    raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://mlflow-new-deploy.internal.org.cloud/api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile failed with exception HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/16/d18f0bd33976488d8fa34bc283c8e2a2/artifacts/1.8gbfile (Caused by ResponseError('too many 504 error responses'))

Other info / logs

2022/12/20 05:49:18 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl MSSQLImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
[2022-12-20 05:49:22 +0000] [73] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 4
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 8000
  graceful_timeout: 75
  keepalive: 3600
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1002930000
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: mlflow.server:app
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f7e3a55a680>
  on_reload: <function OnReload.on_reload at 0x7f7e3a55a7a0>
  when_ready: <function WhenReady.when_ready at 0x7f7e3a55a8c0>
  pre_fork: <function Prefork.pre_fork at 0x7f7e3a55a9e0>
  post_fork: <function Postfork.post_fork at 0x7f7e3a55ab00>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f7e3a55ac20>
  worker_int: <function WorkerInt.worker_int at 0x7f7e3a55ad40>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f7e3a55ae60>
  pre_exec: <function PreExec.pre_exec at 0x7f7e3a55af80>
  pre_request: <function PreRequest.pre_request at 0x7f7e3a55b0a0>
  post_request: <function PostRequest.post_request at 0x7f7e3a55b130>
  child_exit: <function ChildExit.child_exit at 0x7f7e3a55b250>
  worker_exit: <function WorkerExit.worker_exit at 0x7f7e3a55b370>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f7e3a55b490>
  on_exit: <function OnExit.on_exit at 0x7f7e3a55b5b0>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2022-12-20 05:49:22 +0000] [73] [INFO] Starting gunicorn 20.1.0
[2022-12-20 05:49:22 +0000] [73] [DEBUG] Arbiter booted
[2022-12-20 05:49:22 +0000] [73] [INFO] Listening at: http://0.0.0.0:5000 (73)
[2022-12-20 05:49:22 +0000] [73] [INFO] Using worker: sync
[2022-12-20 05:49:22 +0000] [74] [INFO] Booting worker with pid: 74
[2022-12-20 05:49:22 +0000] [75] [INFO] Booting worker with pid: 75
[2022-12-20 05:49:22 +0000] [76] [INFO] Booting worker with pid: 76
[2022-12-20 05:49:22 +0000] [77] [INFO] Booting worker with pid: 77
[2022-12-20 05:49:22 +0000] [73] [DEBUG] 4 workers
[2022-12-20 05:49:33 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:57:30 +0000] [77] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-20 05:57:30 +0000] [77] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-20 05:57:30 +0000] [77] [DEBUG] GET /api/2.0/mlflow/experiments/get
[2022-12-20 05:57:30 +0000] [74] [DEBUG] POST /api/2.0/mlflow/runs/create
[2022-12-20 05:57:30 +0000] [74] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-20 05:57:30 +0000] [74] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-20 05:57:30 +0000] [74] [DEBUG] GET /api/2.0/mlflow/runs/get
[2022-12-20 05:57:30 +0000] [74] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 05:57:38 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:57:47 +0000] [77] [DEBUG] GET /static-files/static/media/fontawesome-webfont.20fd1704ea223900efa9.woff2
[2022-12-20 05:57:49 +0000] [76] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 05:57:49 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:57:49 +0000] [77] [DEBUG] GET /static-files/static/js/547.a604119a.chunk.js
[2022-12-20 05:57:57 +0000] [76] [DEBUG] GET /static-files/static/media/laptop.f3a6b3016fbf319305f629fcbcf937a9.svg
[2022-12-20 05:58:16 +0000] [76] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 05:58:24 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:58:45 +0000] [74] [DEBUG] Ignoring connection reset
[2022-12-20 05:59:25 +0000] [75] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 05:59:28 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:59:38 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 05:59:40 +0000] [76] [DEBUG] Ignoring connection reset
[2022-12-20 06:00:28 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:00:28 +0000] [76] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 06:00:38 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:00:41 +0000] [75] [DEBUG] Ignoring connection reset
[2022-12-20 06:00:42 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:01:40 +0000] [76] [DEBUG] Ignoring connection reset
[2022-12-20 06:01:40 +0000] [77] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 06:01:42 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:46 +0000] [77] [DEBUG] Ignoring connection reset
[2022-12-20 06:02:49 +0000] [77] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/search
[2022-12-20 06:02:51 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:51 +0000] [75] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:02:51 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:52 +0000] [77] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:02:53 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:02:54 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:03:00 +0000] [75] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:03:00 +0000] [76] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/17/acbc4f3ff7ec4fd3a1fc35c0f91d317c/artifacts/1.8gbfile
[2022-12-20 06:03:01 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:03:49 +0000] [77] [DEBUG] POST /api/2.0/mlflow/runs/update
[2022-12-20 06:03:54 +0000] [74] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:04:02 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:04:09 +0000] [77] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:04:12 +0000] [76] [DEBUG] Ignoring connection reset
[2022-12-20 06:04:16 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:08:26 +0000] [77] [DEBUG] GET //
[2022-12-20 06:08:26 +0000] [75] [DEBUG] GET //static-files/static/css/main.3b6f4584.css
[2022-12-20 06:08:26 +0000] [77] [DEBUG] GET //static-files/static/js/main.6125589f.js
[2022-12-20 06:08:34 +0000] [77] [DEBUG] GET //ajax-api/2.0/mlflow/experiments/search
[2022-12-20 06:08:34 +0000] [74] [DEBUG] GET //static-files/static/media/home-logo.b14e3dd7dc63ea1769c6.png
[2022-12-20 06:08:34 +0000] [75] [DEBUG] GET //static-files/static/js/714.c7ed3611.chunk.js
[2022-12-20 06:08:35 +0000] [75] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:35 +0000] [77] [DEBUG] POST //ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:08:35 +0000] [76] [DEBUG] GET //ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:08:35 +0000] [74] [DEBUG] GET //static-files/static/css/547.f3323e81.chunk.css
[2022-12-20 06:08:35 +0000] [76] [DEBUG] POST //ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:08:35 +0000] [76] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:36 +0000] [74] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:36 +0000] [74] [DEBUG] GET //static-files/static/js/547.a604119a.chunk.js
[2022-12-20 06:08:36 +0000] [76] [DEBUG] GET //static-files/favicon.ico
[2022-12-20 06:08:36 +0000] [77] [DEBUG] GET //static-files/static/js/869.aae22f22.chunk.js
[2022-12-20 06:08:39 +0000] [75] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search
[2022-12-20 06:10:46 +0000] [76] [DEBUG] GET /ajax-api/2.0/mlflow/experiments/get
[2022-12-20 06:10:47 +0000] [76] [DEBUG] POST /ajax-api/2.0/mlflow/runs/search

What component(s) does this bug affect?

  • [X] area/artifacts: Artifact stores and artifact logging
  • [X] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [X] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

Neethugkp avatar Dec 20 '22 11:12 Neethugkp

@Neethugkp Can you clean up the issue description? Please wrap code and logs with a code block.

harupy avatar Dec 22 '22 11:12 harupy

@harupy :- updated. Hope its readable.

Neethugkp avatar Dec 22 '22 15:12 Neethugkp

@Neethugkp Can you upload a small file?

harupy avatar Dec 22 '22 15:12 harupy

@harupy . Upload of filesize less than 750 MB works without any error. Larger file also gets uploaded in s3. But after the file upload is successful, code execution abruptly terminates with 504 error. (lines after mlflow.log_artifact(largefile) are not executed.)

Neethugkp avatar Dec 22 '22 16:12 Neethugkp

Mlflow logs of smaller file upload

[2022-12-22 17:03:04 +0000] [74] [INFO] Starting gunicorn 20.1.0
[2022-12-22 17:03:04 +0000] [74] [DEBUG] Arbiter booted
[2022-12-22 17:03:04 +0000] [74] [INFO] Listening at: http://0.0.0.0:5000 (74)
[2022-12-22 17:03:04 +0000] [74] [INFO] Using worker: sync
[2022-12-22 17:03:04 +0000] [75] [INFO] Booting worker with pid: 75
[2022-12-22 17:03:04 +0000] [76] [INFO] Booting worker with pid: 76
[2022-12-22 17:03:04 +0000] [77] [INFO] Booting worker with pid: 77
[2022-12-22 17:03:04 +0000] [74] [DEBUG] 4 workers
[2022-12-22 17:03:04 +0000] [78] [INFO] Booting worker with pid: 78
[2022-12-22 17:07:11 +0000] [76] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-22 17:07:13 +0000] [78] [DEBUG] POST /api/2.0/mlflow/experiments/create
[2022-12-22 17:07:15 +0000] [75] [DEBUG] GET /api/2.0/mlflow/experiments/get
[2022-12-22 17:07:18 +0000] [76] [DEBUG] GET /api/2.0/mlflow/experiments/get-by-name
[2022-12-22 17:07:21 +0000] [75] [DEBUG] GET /api/2.0/mlflow/experiments/get
[2022-12-22 17:07:25 +0000] [76] [DEBUG] POST /api/2.0/mlflow/runs/create
[2022-12-22 17:07:29 +0000] [75] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-22 17:07:31 +0000] [77] [DEBUG] POST /api/2.0/mlflow/runs/log-metric
[2022-12-22 17:07:34 +0000] [75] [DEBUG] GET /api/2.0/mlflow/runs/get
[2022-12-22 17:07:36 +0000] [78] [DEBUG] PUT /api/2.0/mlflow-artifacts/artifacts/18/edec56b50a5b47f194b2cbc8333f2625/artifacts/output.txt
[2022-12-22 17:07:39 +0000] [75] [DEBUG] POST /api/2.0/mlflow/runs/update

Neethugkp avatar Dec 22 '22 17:12 Neethugkp

@Neethugkp Increasing MLFLOW_HTTP_REQUEST_TIMEOUT (environment variable, default value: 120) might help.

MLFLOW_HTTP_REQUEST_TIMEOUT=360 python run.py

harupy avatar Dec 26 '22 02:12 harupy

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

mlflow-automation avatar Dec 28 '22 00:12 mlflow-automation

@Neethugkp Increasing MLFLOW_HTTP_REQUEST_TIMEOUT (environment variable, default value: 120) might help.

MLFLOW_HTTP_REQUEST_TIMEOUT=360 python run.py

@harupy MLFLOW_HTTP_REQUEST_TIMEOUT=360 environment variable is set.

There is no change with upload of large files. after the command mlflow.log_artifact(largefile), execution terminates with 504 error.



Traceback (most recent call last):
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest (Caused by ResponseError('too many 504 error responses'))

 

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 166, in http_request
    return _get_http_response_with_retries(
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 97, in _get_http_response_with_retries
    return session.request(method, url, **kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest (Caused by ResponseError('too many 504 error responses'))

 

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):
  File "/home/userid/mlflow-testing/run.py", line 88, in <module>
    mlflow.log_artifact("largefile_latest")
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 778, in log_artifact
    MlflowClient().log_artifact(run_id, local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1002, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 416, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
    resp = http_request(self._host_creds, endpoint, "PUT", data=f)
  File "/home/userid/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 184, in http_request
    raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://mlflow-new-deploy.internal.org.cloud/api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest failed with exception HTTPSConnectionPool(host='mlflow-new-deploy.internal.org.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/17/9aa5d915e62b4298b43b78bc7d41ec54/artifacts/largefile_latest (Caused by ResponseError('too many 504 error responses'))

Neethugkp avatar Jan 02 '23 17:01 Neethugkp

@BenWilson2 @dbczumar @harupy @WeichenXu123 We have tested the latest version 2.2.1.

Large files are getting uploaded(tested up to logging 3GB files ). But at the end of execution its throws error 504 at client

MLflow version: 2.2.1
Tracking URI: https://mlflow-rare23.internal.cloud/
experiment_id 3
System information: Linux #1 SMP Wed Dec 14 16:00:01 EST 2022
Python version: 3.10.8
MLflow version: 2.2.1
MLflow module location: /home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: https://mlflow-rare23.internal.cloud/
Registry URI: https://mlflow-rare23.internal.cloud/
Active experiment ID: 3
Active run ID: 0ddf506217ca487c9a6493de3939c992
Active run artifact URI: mlflow-artifacts:/3/0ddf506217ca487c9a6493de3939c992/artifacts
MLflow environment variables:
  MLFLOW_ARTIFACT_UPLOAD_DOWNLOAD_TIMEOUT: 1200
  MLFLOW_TRACKING_INSECURE_TLS: True
MLflow dependencies:
  Flask: 2.2.2
  Jinja2: 3.1.2
  alembic: 1.9.1
  click: 8.1.3
  cloudpickle: 2.2.1
  databricks-cli: 0.17.4
  docker: 6.0.0
  entrypoints: 0.4
  gitpython: 3.1.30
  gunicorn: 20.1.0
  importlib-metadata: 4.11.4
  markdown: 3.4.1
  matplotlib: 3.6.3
  numpy: 1.23.5
  packaging: 22.0
  pandas: 1.5.3
  protobuf: 4.21.12
  pyarrow: 10.0.1
  pytz: 2022.7.1
  pyyaml: 6.0
  querystring-parser: 1.2.4
  requests: 2.28.2
  scikit-learn: 1.2.1
  scipy: 1.10.0
  shap: 0.41.0
  sqlalchemy: 1.4.46
  sqlparse: 0.4.3
write output





Traceback (most recent call last):
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  [Previous line repeated 2 more times]
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/connectionpool.py", line 868, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mlflow-rare23.internal.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile (Caused by ResponseError('too many 504 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 167, in http_request
    return _get_http_response_with_retries(
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 98, in _get_http_response_with_retries
    return session.request(method, url, **kwargs)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/requests/adapters.py", line 556, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='mlflow-rare23.internal.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile (Caused by ResponseError('too many 504 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/<userid>/mlflow-testing/run.py", line 98, in <module>
    mlflow.log_artifact("largefile")
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 783, in log_artifact
    MlflowClient().log_artifact(run_id, local_path, artifact_path)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/client.py", line 1023, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 439, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/store/artifact/http_artifact_repo.py", line 25, in log_artifact
    resp = http_request(self._host_creds, endpoint, "PUT", data=f)
  File "/home/<userid>/.conda/envs/mlflow_env/lib/python3.10/site-packages/mlflow/utils/rest_utils.py", line 185, in http_request
    raise MlflowException(f"API request to {url} failed with exception {e}")
mlflow.exceptions.MlflowException: API request to https://mlflow-rare23.internal.cloud/api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile failed with exception HTTPSConnectionPool(host='mlflow-rare23.internal.cloud', port=443): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/3/0ddf506217ca487c9a6493de3939c992/artifacts/largefile (Caused by ResponseError('too many 504 error responses'))

Neethugkp avatar Mar 14 '23 14:03 Neethugkp

Same here with mlflow 2.4.0 (client) and mlflow 2.4.1 (server) and a file of 830 MB. Any insights or workarounds?

Earthwings avatar Aug 22 '23 09:08 Earthwings

Same here with mlflow 2.4.0 (client) and mlflow 2.4.1 (server) and a file of 830 MB. Any insights or workarounds?

You can work around it by replacing the call to mlflow.log_artifacts() with a sequence of calls to mlflow.log_artifact() for all files in the local directory recursively, while catching and ignoring the exception for large files. Not exactly pretty but seems to do the job for now. For reference:

def log_artifacts(local_dir: str, artifact_path: str):
    for root, _, files in os.walk(local_dir):
        for file in files:
            upload_path = artifact_path
            if root != local_dir:
                rel_path = os.path.relpath(root, local_dir)
                upload_path = os.path.join(upload_path, rel_path)
            try:
                mlflow.log_artifact(local_path=os.path.join(root, file), artifact_path=upload_path)
            except mlflow.exceptions.MlflowException as ex:
                if os.path.getsize(os.path.join(root, file)) > 750 * pow(1000, 2):
                    # Workaround for https://github.com/mlflow/mlflow/issues/7564: Just ignore it
                    pass
                else:
                    raise ex

Earthwings avatar Sep 13 '23 16:09 Earthwings

@Earthwings The workaround that worked for us was by increasing the resource limits (cpu/memory). From version 2.2.0 onwards it works.

Neethugkp avatar Sep 15 '23 16:09 Neethugkp

@harupy Would like to reopen the issue as it fails to upload files with more than 3GB files.

In older versions it failed to upload files more than 800 MB . As a workaround resource limits were increased and it fixed the problem for file uploads up to 2.5 GB.

Request you to consider a permanent fix for the issue addressed. Could you please incorporate the workaround provided by Earthwings in the comment section https://github.com/mlflow/mlflow/issues/7564#issuecomment-1717949904

Neethugkp avatar Nov 15 '23 13:11 Neethugkp

I'm experiencing a similar issue while trying to log a model of size 3.7 GB using MLflow version 2.7.1. If you could assist, I'd greatly appreciate it.

anjalyv avatar Nov 24 '23 06:11 anjalyv

Did you try the workaround from https://github.com/mlflow/mlflow/issues/7564#issuecomment-1717949904 yet?

Earthwings avatar Nov 24 '23 07:11 Earthwings

My issue got fixed by running the server with --gunicorn-opts="--timeout 900" following the comment given in the link

anjalyv avatar Nov 24 '23 08:11 anjalyv

My issue got fixed by running the server with --gunicorn-opts="--timeout 900" following the comment given in the link

This worked for me too. Thank you for sharing!

rfmac-perceptus avatar Feb 03 '24 23:02 rfmac-perceptus