lithops icon indicating copy to clipboard operation
lithops copied to clipboard

Catch HTTP 409 error for Code Engine

Open sergii-mamedov opened this issue 1 year ago • 4 comments

Recently, I have been seeing frequent HTTP 409 errors when processing data in bulk. I spoke with the IBM customer support to clarify where the source of the problem is and how it can be fixed. So far, all I've gotten from them is this response:

This looks like that you have some application code which is interacting with a Code Engine project using the Kubernetes API endpoint. 
There is a quite famous bug in Kubernetes that concurrent create operations can lead to 409 errors of this kind. 
You will have to retry them in your application code. A retry would look like this in pseudo code:

const MAX_RETRY = 5

  main() {
    myObject = { ... }
    createWithRetry(myObject)
  }

  createWithRetry(object) {
    for (i = 0; i < MAX_RETRY; i ++) {
       sleepSeconds(i)
       try {
          return kubeClient.create(object)
       } catch (e) {
          if (!isRetryable(e) || i == MAX_RETRY - 1) {
            logError(e)
            throw e
          }
          logWarning("Retrying due to", e)
       }
    }
  }

  isRetryable(exception) {
    if (exception is ConflictError) {
      return true
    }

    // It makes sense to retry more like network failures

    return false
  }

The original stacktrace I get:

Traceback (most recent call last):
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/daemons/lithops.py", line 81, in _callback
    self._manager.annotate_lithops(ds=ds, del_first=msg.get('del_first', False))
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/daemons/dataset_manager.py", line 114, in annotate_lithops
    ServerAnnotationJob(executor, ds, perf).run()
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/annotation_lithops/annotation_job.py", line 321, in run
    self.db_formula_image_ids = self.pipe.store_images_to_s3(self.ds.id)
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/annotation_lithops/pipeline.py", line 199, in store_images_to_s3
    return store_images_to_s3(
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/annotation_lithops/store_images.py", line 67, in store_images_to_s3
    results = executor.map(_upload_png_batch, [(cobj,) for cobj in png_cobjs], runtime_memory=512)
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/annotation_lithops/executor.py", line 292, in map
    raise exc
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/annotation_lithops/executor.py", line 325, in run
    futures = executor.map(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/executors.py", line 288, in map
    futures = self.invoker.run_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 266, in run_job
    futures = self._run_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 205, in _run_job
    raise e
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'audit-id': 'a082f7b4-70d3-4708-a29e-8ec22f965220', 'cache-control': 'no-cache, private', 'content-length': '326', 'content-type': 'application/json', 'date': 'Sat, 03 Sep 2022 03:19:13 GMT', 'x-kubernetes-pf-flowschema-uid': '05dc5f92-6c95-416b-b707-da00a57e0adc', 'x-kubernetes-pf-prioritylevel-uid': '8d40cc3d-a5c4-47ee-8978-ec5f7a197920'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"5s4f5qcqf4f\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"5s4f5qcqf4f","kind":"resourcequotas"},"code":409}

Is it possible to add handling of such exceptions to lithops? I use lithops 2.7.0.

sergii-mamedov avatar Sep 07 '22 11:09 sergii-mamedov

@kpavel can you handle it? :)

gilv avatar Sep 27 '22 19:09 gilv

As far as I understand, the code responsible for the retry goes here and I need to explicitly insert the connection_retries value into the code engine config.

sergii-mamedov avatar Sep 27 '22 21:09 sergii-mamedov

Hello @sergii-mamedov, is it a full stacktrace? No other traces from code_engine level?

regarding connection_retries - yes, you need to specify in the config file retries number, e.g. 10.

the issue here is that we are currently handling connection retries when 409 exception happens with the reason "AlreadyExists", while the one you get has a reason "Conflict". We can add another reason in the filtering to let it retry.

kpavel avatar Sep 28 '22 12:09 kpavel

@kpavel

Traceback (most recent call last):
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/annotation_lithops/executor.py", line 325, in run
    futures = executor.map(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/executors.py", line 277, in map
    futures = self.invoker.run_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 269, in run_job
    futures = self._run_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 208, in _run_job
    raise e
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 205, in _run_job
    self._invoke_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 252, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/serverless.py", line 59, in invoke
    return self.backend.invoke(runtime_name, runtime_memory, job_payload)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/backends/code_engine/code_engine.py", line 406, in invoke
    self._run_job(jobrun_res)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/backends/code_engine/code_engine.py", line 48, in decorated_func
    return func(*args, **kwargs)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/backends/code_engine/code_engine.py", line 414, in _run_job
    self.custom_api.create_namespaced_custom_object(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object
    return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)  # noqa: E501
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 344, in create_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/rest.py", line 274, in POST
    return self.request("POST", url,
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'audit-id': 'a082f7b4-70d3-4708-a29e-8ec22f965220', 'cache-control': 'no-cache, private', 'content-length': '326', 'content-type': 'application/json', 'date': 'Sat, 03 Sep 2022 03:19:13 GMT', 'x-kubernetes-pf-flowschema-uid': '05dc5f92-6c95-416b-b707-da00a57e0adc', 'x-kubernetes-pf-prioritylevel-uid': '8d40cc3d-a5c4-47ee-8978-ec5f7a197920'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"5s4f5qcqf4f\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"5s4f5qcqf4f","kind":"resourcequotas"},"code":409}


sergii-mamedov avatar Sep 28 '22 21:09 sergii-mamedov

@sergii-mamedov Does #995 solve your issue?

JosepSampe avatar Oct 03 '22 12:10 JosepSampe

@JosepSampe Sorry for a long answer. I deployed this changes on production. I will get back to you in a couple of days with an answer.

sergii-mamedov avatar Oct 10 '22 08:10 sergii-mamedov

@kpavel @JosepSampe Unfortunately, the patch didn't change the situation (which is a bit strange for me). I added some logging, stacktrace below:

2022-10-13 18:36:00,920 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:172 - ExecutorID 77ece4-14 | JobID M005 - Starting function invocation: process_centr_segment() - Total: 277 activations
2022-10-13 18:36:01,360 - INFO - lithops.serverless.backends.code_engine.code_engine[Thread-1-ex] - code_engine.py:54 - connection_retries=10
2022-10-13 18:36:01,361 - INFO - lithops.serverless.backends.code_engine.code_engine[Thread-1-ex] - code_engine.py:58 - HTTP409
2022-10-13 18:36:11,376 - ERROR - engine.lithops-wrapper[Thread-1] - executor.py:286 - process_centr_segment raised an exception. Failed activation(s): [] ID(s): []
Traceback (most recent call last):
  File "/opt/dev/metaspace/metaspace/engine/sm/engine/annotation_lithops/executor.py", line 325, in run
    futures = executor.map(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/executors.py", line 288, in map
    futures = self.invoker.run_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 266, in run_job
    futures = self._run_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 205, in _run_job
    raise e
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 202, in _run_job
    self._invoke_job(job)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/invokers.py", line 249, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/serverless.py", line 59, in invoke
    return self.backend.invoke(runtime_name, runtime_memory, job_payload)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/backends/code_engine/code_engine.py", line 396, in invoke
    self._run_job(jobrun_res)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/backends/code_engine/code_engine.py", line 67, in decorated_func
    raise e
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/backends/code_engine/code_engine.py", line 52, in decorated_func
    return func(*args, **kwargs)
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/lithops/serverless/backends/code_engine/code_engine.py", line 404, in _run_job
    self.custom_api.create_namespaced_custom_object(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object
    return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)  # noqa: E501
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 344, in create_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/rest.py", line 274, in POST
    return self.request("POST", url,
  File "/opt/dev/miniconda3/envs/sm38/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'audit-id': 'd4735826-f7ab-4244-91ec-43b59d77b169', 'cache-control': 'no-cache, private', 'content-length': '326', 'content-type': 'application/json', 'date': 'Thu, 13 Oct 2022 16:36:01 GMT', 'x-kubernetes-pf-flowschema-uid': '05dc5f92-6c95-4$
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"5s4f5qcqf4f\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","det$

sergii-mamedov avatar Oct 14 '22 08:10 sergii-mamedov

@sergii-mamedov Can you please either change lithops logging level to DEBUG or modify this line in your code to write warning level logs.

https://github.com/kpavel/lithops/blob/fa8067d5dbd1429eebedd58513f34b14cfd4b14e/lithops/serverless/backends/code_engine/code_engine.py#L57

E.g.

logger.warning("Encountered conflict error {}, ignoring".format(body.get('message')))

Could be we are actually running out of retries, but don't notice because this log level is debug. I'll update this log later in the code as well to warning.

kpavel avatar Oct 14 '22 15:10 kpavel

@kpavel Did it, I will let you know when this problem occurs again.

sergii-mamedov avatar Oct 17 '22 08:10 sergii-mamedov

@kpavel I understood what the problem was. There is no condition in the code to handle the block for HTTP 409. You should either add `continue' to this block

if e.status == 409:
    some_logic
    continue
if e.status == 500:
    some_logic
else:
    raise e

or perhaps it is better to make a solid if - elif - else block:

if e.status == 409:
    some_logic
elif e.status == 500:
    some_logic
else:
    raise e

I tested first option - works well.

sergii-mamedov avatar Oct 18 '22 08:10 sergii-mamedov

@sergii-mamedov Fixed

JosepSampe avatar Oct 19 '22 10:10 JosepSampe

@JosepSampe Thanks, I will be grateful for the release of a new version when possible.

sergii-mamedov avatar Oct 19 '22 10:10 sergii-mamedov

@sergii-mamedov @JosepSampe I will draft new relase

gilv avatar Oct 19 '22 22:10 gilv