yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] Require validation for `MlflowClient.log_batch` when updating params

Open Mathanraj-Sharma opened this issue 2 years ago • 8 comments

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

1.27.0

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac Book Big Sur (11.4)
  • Python version: Python 3.8.13
  • yarn version, if running the dev UI:

Describe the problem

I am using mlflow.tracking.MlflowClient for tracking purposes. If I am not wrong, run params are not immutable in Mlflow tracking in order to ensure the reproducibility of an experiment run.

Calling MlflowClient.log_batch to update params (with duplicate keys) does not produce a proper error message.

It should produce

MlflowException: INVALID_PARAMETER_VALUE: Changing param values is not allowed. Param with key='test_param' was already logged with value='101' for run ID='4b4bc4bdf2ab4c0992fd1879e8580d29'. Attempted logging new value '100'.

Instead, it produces

MlflowException: API request to http://localhost:5000/api/2.0/mlflow/runs/log-batch failed with exception HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))

Tracking information

mlflow server --backend-store-uri expstore --default-artifact-root expstore --host localhost

Code to reproduce issue

import mlflow

from mlflow.tracking import MlflowClient
from mlflow.entities import Param

# create mlflow client
client = MlflowClient(tracking_uri="http://localhost:5000")

# create mlflow experiment
exp = client.create_experiment("test_exp")

# create run
run = client.create_run(experiment_id=2)

# first attempt to log params
params = {
    "item_1": 55,
    "test_param": 101
}

params_arr = [Param(key, str(value)) for key, value in params.items()]

client.log_batch(
    run_id=run.info.run_id,
    params=params_arr
)


# second attempt to log params with modified value for test_param
params = {
    "item_1": 55,
    "test_param": 100,
    "new": 5
}

params_arr = [Param(key, str(value)) for key, value in params.items()]

client.log_batch(
    run_id=run.info.run_id,
    params=params_arr
)

Other info / logs

---------------------------------------------------------------------------
MaxRetryError                             Traceback (most recent call last)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/adapters.py:489, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    488 if not chunked:
--> 489     resp = conn.urlopen(
    490         method=request.method,
    491         url=url,
    492         body=request.body,
    493         headers=request.headers,
    494         redirect=False,
    495         assert_same_host=False,
    496         preload_content=False,
    497         decode_content=False,
    498         retries=self.max_retries,
    499         timeout=timeout,
    500     )
    502 # Send the request.
    503 else:

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:876, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    875     log.debug("Retry: %s", url)
--> 876     return self.urlopen(
    877         method,
    878         url,
    879         body,
    880         headers,
    881         retries=retries,
    882         redirect=redirect,
    883         assert_same_host=assert_same_host,
    884         timeout=timeout,
    885         pool_timeout=pool_timeout,
    886         release_conn=release_conn,
    887         chunked=chunked,
    888         body_pos=body_pos,
    889         **response_kw
    890     )
    892 return response

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:876, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    875     log.debug("Retry: %s", url)
--> 876     return self.urlopen(
    877         method,
    878         url,
    879         body,
    880         headers,
    881         retries=retries,
    882         redirect=redirect,
    883         assert_same_host=assert_same_host,
    884         timeout=timeout,
    885         pool_timeout=pool_timeout,
    886         release_conn=release_conn,
    887         chunked=chunked,
    888         body_pos=body_pos,
    889         **response_kw
    890     )
    892 return response

    [... skipping similar frames: HTTPConnectionPool.urlopen at line 876 (2 times)]

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:876, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    875     log.debug("Retry: %s", url)
--> 876     return self.urlopen(
    877         method,
    878         url,
    879         body,
    880         headers,
    881         retries=retries,
    882         redirect=redirect,
    883         assert_same_host=assert_same_host,
    884         timeout=timeout,
    885         pool_timeout=pool_timeout,
    886         release_conn=release_conn,
    887         chunked=chunked,
    888         body_pos=body_pos,
    889         **response_kw
    890     )
    892 return response

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:866, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    865 try:
--> 866     retries = retries.increment(method, url, response=response, _pool=self)
    867 except MaxRetryError:

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    591 if new_retry.is_exhausted():
--> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

RetryError                                Traceback (most recent call last)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:151, in http_request(host_creds, endpoint, method, max_retries, backoff_factor, retry_codes, timeout, **kwargs)
    150 try:
--> 151     return _get_http_response_with_retries(
    152         method,
    153         url,
    154         max_retries,
    155         backoff_factor,
    156         retry_codes,
    157         headers=headers,
    158         verify=verify,
    159         timeout=timeout,
    160         **kwargs,
    161     )
    162 except Exception as e:

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:91, in _get_http_response_with_retries(method, url, max_retries, backoff_factor, retry_codes, **kwargs)
     90 session = _get_request_session(max_retries, backoff_factor, retry_codes)
---> 91 return session.request(method, url, **kwargs)

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/sessions.py:587, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    586 send_kwargs.update(settings)
--> 587 resp = self.send(prep, **send_kwargs)
    589 return resp

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/sessions.py:701, in Session.send(self, request, **kwargs)
    700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
    703 # Total elapsed time of the request (approximately)

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/adapters.py:556, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    555 if isinstance(e.reason, ResponseError):
--> 556     raise RetryError(e, request=request)
    558 if isinstance(e.reason, _ProxyError):

RetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

MlflowException                           Traceback (most recent call last)
Input In [13], in <cell line: 9>()
      1 params = {
      2     "new_value": 55,
      3     "test_param": 100,
      4     "new": 5
      5 }
      7 params_arr = [Param(key, str(value)) for key, value in params.items()]
----> 9 client.log_batch(
     10     run_id=run.info.run_id,
     11     params=params_arr
     12 )

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/tracking/client.py:918, in MlflowClient.log_batch(self, run_id, metrics, params, tags)
    861 def log_batch(
    862     self,
    863     run_id: str,
   (...)
    866     tags: Sequence[RunTag] = (),
    867 ) -> None:
    868     """
    869     Log multiple metrics, params, and/or tags.
    870 
   (...)
    916         status: FINISHED
    917     """
--> 918     self._tracking_client.log_batch(run_id, metrics, params, tags)

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py:315, in TrackingServiceClient.log_batch(self, run_id, metrics, params, tags)
    312     metrics_batch = metrics[:metrics_batch_size]
    313     metrics = metrics[metrics_batch_size:]
--> 315     self.store.log_batch(
    316         run_id=run_id, metrics=metrics_batch, params=params_batch, tags=tags_batch
    317     )
    319 for metrics_batch in chunk_list(metrics, chunk_size=MAX_METRICS_PER_BATCH):
    320     self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py:309, in RestStore.log_batch(self, run_id, metrics, params, tags)
    305 tag_protos = [tag.to_proto() for tag in tags]
    306 req_body = message_to_json(
    307     LogBatch(metrics=metric_protos, params=param_protos, tags=tag_protos, run_id=run_id)
    308 )
--> 309 self._call_endpoint(LogBatch, req_body)

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
     54 endpoint, method = _METHOD_TO_INFO[api]
     55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:253, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
    249     response = http_request(
    250         host_creds=host_creds, endpoint=endpoint, method=method, params=json_body
    251     )
    252 else:
--> 253     response = http_request(
    254         host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
    255     )
    256 response = verify_rest_response(response, endpoint)
    257 js_dict = json.loads(response.text)

File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:163, in http_request(host_creds, endpoint, method, max_retries, backoff_factor, retry_codes, timeout, **kwargs)
    151     return _get_http_response_with_retries(
    152         method,
    153         url,
   (...)
    160         **kwargs,
    161     )
    162 except Exception as e:
--> 163     raise MlflowException("API request to %s failed with exception %s" % (url, e))

MlflowException: API request to http://localhost:5000/api/2.0/mlflow/runs/log-batch failed with exception HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))


What component(s) does this bug affect?

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [ ] area/docs: MLflow documentation pages
  • [ ] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
  • [ ] area/projects: MLproject format, project running backends
  • [ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [X] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

Mathanraj-Sharma avatar Jun 30 '22 12:06 Mathanraj-Sharma

Make sense! @Mathanraj-Sharma Would you create a PR to fix it ? I will help review.

WeichenXu123 avatar Jun 30 '22 13:06 WeichenXu123

@WeichenXu123 yes I can create a PR with a fix, please assign it to me

Mathanraj-Sharma avatar Jun 30 '22 13:06 Mathanraj-Sharma

@WeichenXu123 I am not able to push my branch could you please give me permission

$ git push origin mathan/fix/issue-616

remote: Permission to mlflow/mlflow.git denied to Mathanraj-Sharma.
fatal: unable to access 'https://github.com/mlflow/mlflow.git/': The requested URL returned error: 403

Mathanraj-Sharma avatar Jul 01 '22 11:07 Mathanraj-Sharma

@Mathanraj-Sharma You cannot push to mlflow/mlflow. Can you create a fork, push commits there, and create a PR?

harupy avatar Jul 01 '22 11:07 harupy

@harupy thanks will do it

Mathanraj-Sharma avatar Jul 01 '22 11:07 Mathanraj-Sharma

@WeichenXu123 I have created PR #6184 with implementation to fix this bug.

I am not able to add you as a reviewer in my PR, do I need any special permission to do it?

Mathanraj-Sharma avatar Jul 03 '22 13:07 Mathanraj-Sharma

@Mathanraj-Sharma @WeichenXu123 @harupy I've closed https://github.com/mlflow/mlflow/pull/6184 because I think we need a different approach. Instead of adding client-side validation, can we trace the source of the 500-level error being returned by mlflow server and see if we can change the exception to a 400-level error?

dbczumar avatar Jul 05 '22 20:07 dbczumar

@dbczumar could you please point out references to start fixing this from the server side?

Mathanraj-Sharma avatar Jul 06 '22 12:07 Mathanraj-Sharma

This also seems to apply other validation problems: e.g. if the length of parameter value is longer than 250 digits. But this just become apparent if you change the server to local:

mlflow.exceptions.MlflowException: Param value [MASKED] had length 301, which exceeded length limit of 250

maximilianreimer avatar Dec 09 '22 14:12 maximilianreimer