atlassian-python-api icon indicating copy to clipboard operation
atlassian-python-api copied to clipboard

Add exponential backoff and retry functionality to rest_client

Open jp-harvey opened this issue 2 years ago • 19 comments

While working with the Atlassian Cloud API and attempting to optimize the time it takes to do many operations, it's common to hit the Atlassian API request limits, resulting in annoying negative engineering.

This PR adds a convenience option to the underlying REST client, turned off by default, to retry using exponential backoff.

jp-harvey avatar Dec 24 '21 21:12 jp-harvey

@gonchik I've not tested this quite enough for it not to be a [WIP], will update after I've done so.

jp-harvey avatar Dec 24 '21 21:12 jp-harvey

Codecov Report

Merging #904 (c25ea52) into master (4d90a20) will increase coverage by 0.03%. The diff coverage is 47.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #904      +/-   ##
==========================================
+ Coverage   36.19%   36.22%   +0.03%     
==========================================
  Files          32       32              
  Lines        6335     6357      +22     
  Branches      978      983       +5     
==========================================
+ Hits         2293     2303      +10     
- Misses       3934     3945      +11     
- Partials      108      109       +1     
Impacted Files Coverage Δ
atlassian/rest_client.py 65.68% <47.82%> (-3.03%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4d90a20...c25ea52. Read the comment docs.

codecov-commenter avatar Dec 25 '21 16:12 codecov-commenter

@jp-harvey checked a few options, that's very great step for the rest client module :) Thank you for your initiative, I hope soon it will be done.

gonchik avatar Dec 25 '21 18:12 gonchik

Another change I made was to add these new options to BitbucketBase._new_session_args. At least my assumption was that if you set backoff_and_retry=True when creating the Cloud object you would expect all child objects to inherit the setting.

flichtenheld avatar Jan 19 '22 14:01 flichtenheld

Actually functionality looks good to me. I have a Jenkins job running into the rate limiting, and this patch together with my modifications seems to solve the problem.

flichtenheld avatar Jan 19 '22 14:01 flichtenheld

Great suggestion @flichtenheld , I've been working with Jira and not with Confluence so thanks for adding that in, glad it helped you out :-)

jp-harvey avatar Jan 20 '22 16:01 jp-harvey

Another change I made was to add these new options to BitbucketBase._new_session_args. At least my assumption was that if you set backoff_and_retry=True when creating the Cloud object you would expect all child objects to inherit the setting.

@flichtenheld do you want to add those as a review to this PR as well?

jp-harvey avatar Jan 20 '22 16:01 jp-harvey

Another change I made was to add these new options to BitbucketBase._new_session_args. At least my assumption was that if you set backoff_and_retry=True when creating the Cloud object you would expect all child objects to inherit the setting.

@flichtenheld do you want to add those as a review to this PR as well?

Hmm, not sure what would be the best way to do that given that they are in a completely different file that is not touched by the PR. But you can see them here: https://github.com/flichtenheld/atlassian-python-api/commit/f94cef60fd313bfb6721e921c0f4609b71114046

flichtenheld avatar Jan 20 '22 18:01 flichtenheld

Hi everyone, what are requirements for this pull request to be merged? Do you have any ETA?

mykytoleg avatar May 02 '22 13:05 mykytoleg

@mykytoleg please check the review remarks.

Spacetown avatar May 02 '22 13:05 Spacetown

@gonchik there has been no additional feedback by reviewers for some time following changes / responses and I think this one is probably good to go?

jp-harvey avatar Aug 01 '22 16:08 jp-harvey

@jp-harvey @flichtenheld can we update PR and merge it ?

gonchik avatar Aug 23 '22 08:08 gonchik

@jp-harvey @flichtenheld can we update PR and merge it ?

@gonchik I've made some of the changes, the others I am not comfortable making without doing testing and I don't have a good way to test at the moment. Functionally it works as it is and has been tested many times in real-life scenarios, it's just the code structure that is the concern now.

EDIT: looks like there's a linting issue as well which should be easy enough to sort out.

jp-harvey avatar Aug 24 '22 15:08 jp-harvey

@jp-harvey @flichtenheld can we update PR and merge it ?

@gonchik I've made some of the changes, the others I am not comfortable making without doing testing and I don't have a good way to test at the moment. Functionally it works as it is and has been tested many times in real-life scenarios, it's just the code structure that is the concern now.

EDIT: looks like there's a linting issue as well which should be easy enough to sort out.

It works but it's wired and not easy to understand. If max back off is reached but the response was success a warning is logged.

        backoff = 1
        retries = 0
        responseloop = True
        while responseloop:
            response = self._session.request(
                method=method,
                url=url,
                headers=headers,
                data=data,
                json=json,
                timeout=self.timeout,
                verify=self.verify_ssl,
                files=files,
                proxies=self.proxies,
            )
            if self.backoff_and_retry:
                for em in self.retry_error_matches:
                    if response.status_code == em[0] and response.reason == em[1]:
                        if retries > self.max_backoff_retries:
                            log.warning("Hit max backoff retry limit of {0}, no more retries.".format(self.max_backoff_retries))
                            responseloop = False
                        else:
                            log.warning('Backing off due to error "{0}: {1}" for {2}s'.format(em[0], em[1], backoff))
                            time.sleep(backoff)
                            backoff = backoff * 2 if backoff * 2 < self.max_backoff_seconds else self.max_backoff_seconds
                            retries += 1
                        break  # for loop
            else:
                responseloop = False

Spacetown avatar Sep 02 '22 17:09 Spacetown

The for loop and the if can also be combined using any() together with comparing set objects.

            if self.backoff_and_retry:
                if any([(response.status_code, response.reason) == em for em in self.retry_error_matches]):
                    if retries > self.max_backoff_retries:
                        log.warning("Hit max backoff retry limit of {0}, no more retries.".format(self.max_backoff_retries))
                        responseloop = False
                    else:
                        current_backoff = backoff + (random.random() * backoff / 10)
                        log.warning('Backing off due to error "{0}: {1}" for {2}s'.format(response.status_code, response.reason, current_backoff))
                        time.sleep(current_backoff)
                        backoff = min(backoff * 2, self.max_backoff_seconds)
                        retries += 1
            else:
                responseloop = False

Spacetown avatar Sep 02 '22 17:09 Spacetown

Ok, now I see that there are two nested loops...

Spacetown avatar Jul 28 '23 21:07 Spacetown

@jp-harvey can we finalize it ?

gonchik avatar Aug 18 '23 22:08 gonchik

@jp-harvey can we finalize it ?

Hi @gonchik I'm at a point where it's no longer 100% clear to me what the issues are or what to change. I also don't have a good way to test at the moment, so making changes and testing is going to be high-friction.

I've also discovered subsequently that the requests library has a built-in backoff and retry capability built into sessions, so that's probably a better option all round.

I think we have the following options:

  1. Abandon this PR
  2. Merge it (mostly?) as-is
  3. Someone other than me modifies the code, tests, and once approved it can be merged
  4. I modify the existing code, test, and once approved it gets merged
  5. I resubmit the PR using the retry capability of sessions

Unfortunately I won't have the bandwidth to do it for a while and don't have an easy way to test it at the moment, so 4 and 5 won't be able to happen soon.

Sorry about this, I know the code works as it is now despite not being perfect, and I respect that it's considered not suitable for merging. By the time the PR was reviewed we passed a window I was able to keep working on it and don't want to make blind changes now that may break it. I may be able to in the future though, but would probably do option 5 before 4 because it looks like a more elegant option. I'm also fine with options 1-3 if that's the best thing for this project. Let me know what your preference is.

jp-harvey avatar Aug 19 '23 22:08 jp-harvey

I will take a stab at resurrecting this mechanism since I'm currently working on related code anyway. While I have been using this code for years at this point and it works fine in production (the original code, not the current one in this PR which I'm pretty sure has been patched wrongly), I will try to change the mechanism to use urllib3.util.Retry as suggested to reduce unnecessary code.

flichtenheld avatar Feb 23 '24 16:02 flichtenheld

Since #1339 was merged, I think this PR should be closed.

flichtenheld avatar Feb 27 '24 15:02 flichtenheld

@flichtenheld thanks for taking this one up and your great work on https://github.com/atlassian-api/atlassian-python-api/pull/1339!

jp-harvey avatar Feb 27 '24 16:02 jp-harvey