yocto-gl
yocto-gl copied to clipboard
[BUG] Require validation for `MlflowClient.log_batch` when updating params
Willingness to contribute
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
MLflow version
1.27.0
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac Book Big Sur (11.4)
- Python version: Python 3.8.13
- yarn version, if running the dev UI:
Describe the problem
I am using mlflow.tracking.MlflowClient
for tracking purposes. If I am not wrong, run params are not immutable in Mlflow tracking in order to ensure the reproducibility of an experiment run.
Calling MlflowClient.log_batch
to update params (with duplicate keys) does not produce a proper error message.
It should produce
MlflowException: INVALID_PARAMETER_VALUE: Changing param values is not allowed. Param with key='test_param' was already logged with value='101' for run ID='4b4bc4bdf2ab4c0992fd1879e8580d29'. Attempted logging new value '100'.
Instead, it produces
MlflowException: API request to http://localhost:5000/api/2.0/mlflow/runs/log-batch failed with exception HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))
Tracking information
mlflow server --backend-store-uri expstore --default-artifact-root expstore --host localhost
Code to reproduce issue
import mlflow
from mlflow.tracking import MlflowClient
from mlflow.entities import Param
# create mlflow client
client = MlflowClient(tracking_uri="http://localhost:5000")
# create mlflow experiment
exp = client.create_experiment("test_exp")
# create run
run = client.create_run(experiment_id=2)
# first attempt to log params
params = {
"item_1": 55,
"test_param": 101
}
params_arr = [Param(key, str(value)) for key, value in params.items()]
client.log_batch(
run_id=run.info.run_id,
params=params_arr
)
# second attempt to log params with modified value for test_param
params = {
"item_1": 55,
"test_param": 100,
"new": 5
}
params_arr = [Param(key, str(value)) for key, value in params.items()]
client.log_batch(
run_id=run.info.run_id,
params=params_arr
)
Other info / logs
---------------------------------------------------------------------------
MaxRetryError Traceback (most recent call last)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/adapters.py:489, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
488 if not chunked:
--> 489 resp = conn.urlopen(
490 method=request.method,
491 url=url,
492 body=request.body,
493 headers=request.headers,
494 redirect=False,
495 assert_same_host=False,
496 preload_content=False,
497 decode_content=False,
498 retries=self.max_retries,
499 timeout=timeout,
500 )
502 # Send the request.
503 else:
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:876, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
875 log.debug("Retry: %s", url)
--> 876 return self.urlopen(
877 method,
878 url,
879 body,
880 headers,
881 retries=retries,
882 redirect=redirect,
883 assert_same_host=assert_same_host,
884 timeout=timeout,
885 pool_timeout=pool_timeout,
886 release_conn=release_conn,
887 chunked=chunked,
888 body_pos=body_pos,
889 **response_kw
890 )
892 return response
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:876, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
875 log.debug("Retry: %s", url)
--> 876 return self.urlopen(
877 method,
878 url,
879 body,
880 headers,
881 retries=retries,
882 redirect=redirect,
883 assert_same_host=assert_same_host,
884 timeout=timeout,
885 pool_timeout=pool_timeout,
886 release_conn=release_conn,
887 chunked=chunked,
888 body_pos=body_pos,
889 **response_kw
890 )
892 return response
[... skipping similar frames: HTTPConnectionPool.urlopen at line 876 (2 times)]
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:876, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
875 log.debug("Retry: %s", url)
--> 876 return self.urlopen(
877 method,
878 url,
879 body,
880 headers,
881 retries=retries,
882 redirect=redirect,
883 assert_same_host=assert_same_host,
884 timeout=timeout,
885 pool_timeout=pool_timeout,
886 release_conn=release_conn,
887 chunked=chunked,
888 body_pos=body_pos,
889 **response_kw
890 )
892 return response
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/connectionpool.py:866, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
865 try:
--> 866 retries = retries.increment(method, url, response=response, _pool=self)
867 except MaxRetryError:
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
591 if new_retry.is_exhausted():
--> 592 raise MaxRetryError(_pool, url, error or ResponseError(cause))
594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
RetryError Traceback (most recent call last)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:151, in http_request(host_creds, endpoint, method, max_retries, backoff_factor, retry_codes, timeout, **kwargs)
150 try:
--> 151 return _get_http_response_with_retries(
152 method,
153 url,
154 max_retries,
155 backoff_factor,
156 retry_codes,
157 headers=headers,
158 verify=verify,
159 timeout=timeout,
160 **kwargs,
161 )
162 except Exception as e:
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:91, in _get_http_response_with_retries(method, url, max_retries, backoff_factor, retry_codes, **kwargs)
90 session = _get_request_session(max_retries, backoff_factor, retry_codes)
---> 91 return session.request(method, url, **kwargs)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/sessions.py:587, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
586 send_kwargs.update(settings)
--> 587 resp = self.send(prep, **send_kwargs)
589 return resp
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/sessions.py:701, in Session.send(self, request, **kwargs)
700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
703 # Total elapsed time of the request (approximately)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/requests/adapters.py:556, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
555 if isinstance(e.reason, ResponseError):
--> 556 raise RetryError(e, request=request)
558 if isinstance(e.reason, _ProxyError):
RetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
MlflowException Traceback (most recent call last)
Input In [13], in <cell line: 9>()
1 params = {
2 "new_value": 55,
3 "test_param": 100,
4 "new": 5
5 }
7 params_arr = [Param(key, str(value)) for key, value in params.items()]
----> 9 client.log_batch(
10 run_id=run.info.run_id,
11 params=params_arr
12 )
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/tracking/client.py:918, in MlflowClient.log_batch(self, run_id, metrics, params, tags)
861 def log_batch(
862 self,
863 run_id: str,
(...)
866 tags: Sequence[RunTag] = (),
867 ) -> None:
868 """
869 Log multiple metrics, params, and/or tags.
870
(...)
916 status: FINISHED
917 """
--> 918 self._tracking_client.log_batch(run_id, metrics, params, tags)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py:315, in TrackingServiceClient.log_batch(self, run_id, metrics, params, tags)
312 metrics_batch = metrics[:metrics_batch_size]
313 metrics = metrics[metrics_batch_size:]
--> 315 self.store.log_batch(
316 run_id=run_id, metrics=metrics_batch, params=params_batch, tags=tags_batch
317 )
319 for metrics_batch in chunk_list(metrics, chunk_size=MAX_METRICS_PER_BATCH):
320 self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py:309, in RestStore.log_batch(self, run_id, metrics, params, tags)
305 tag_protos = [tag.to_proto() for tag in tags]
306 req_body = message_to_json(
307 LogBatch(metrics=metric_protos, params=param_protos, tags=tag_protos, run_id=run_id)
308 )
--> 309 self._call_endpoint(LogBatch, req_body)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
54 endpoint, method = _METHOD_TO_INFO[api]
55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:253, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
249 response = http_request(
250 host_creds=host_creds, endpoint=endpoint, method=method, params=json_body
251 )
252 else:
--> 253 response = http_request(
254 host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
255 )
256 response = verify_rest_response(response, endpoint)
257 js_dict = json.loads(response.text)
File ~/miniconda3-intel/envs/mlflow/lib/python3.8/site-packages/mlflow/utils/rest_utils.py:163, in http_request(host_creds, endpoint, method, max_retries, backoff_factor, retry_codes, timeout, **kwargs)
151 return _get_http_response_with_retries(
152 method,
153 url,
(...)
160 **kwargs,
161 )
162 except Exception as e:
--> 163 raise MlflowException("API request to %s failed with exception %s" % (url, e))
MlflowException: API request to http://localhost:5000/api/2.0/mlflow/runs/log-batch failed with exception HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/2.0/mlflow/runs/log-batch (Caused by ResponseError('too many 500 error responses'))
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [ ]
area/docs
: MLflow documentation pages - [ ]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/pipelines
: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates - [ ]
area/projects
: MLproject format, project running backends - [ ]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [X]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
Make sense! @Mathanraj-Sharma Would you create a PR to fix it ? I will help review.
@WeichenXu123 yes I can create a PR with a fix, please assign it to me
@WeichenXu123 I am not able to push my branch could you please give me permission
$ git push origin mathan/fix/issue-616
remote: Permission to mlflow/mlflow.git denied to Mathanraj-Sharma.
fatal: unable to access 'https://github.com/mlflow/mlflow.git/': The requested URL returned error: 403
@Mathanraj-Sharma You cannot push to mlflow/mlflow
. Can you create a fork, push commits there, and create a PR?
@harupy thanks will do it
@WeichenXu123 I have created PR #6184 with implementation to fix this bug.
I am not able to add you as a reviewer in my PR, do I need any special permission to do it?
@Mathanraj-Sharma @WeichenXu123 @harupy I've closed https://github.com/mlflow/mlflow/pull/6184 because I think we need a different approach. Instead of adding client-side validation, can we trace the source of the 500-level error being returned by mlflow server
and see if we can change the exception to a 400-level error?
@dbczumar could you please point out references to start fixing this from the server side?
This also seems to apply other validation problems: e.g. if the length of parameter value is longer than 250 digits. But this just become apparent if you change the server to local:
mlflow.exceptions.MlflowException: Param value [MASKED] had length 301, which exceeded length limit of 250