publish-plugin icon indicating copy to clipboard operation
publish-plugin copied to clipboard

Retry on "502 Bad Gateway" in close/release tasks

Open szpak opened this issue 4 years ago • 5 comments

@alvarosanchez from the Micronaut team suggested that - due to Sonatype Nexus performance issues - it would be good to retry (not immediately fail) on the 502 Bad Gateway errors on the close/release operations.

Sample error message from gradle-nexus-staging-plugin (as a reference):

io.codearte.gradle.nexus.functional.e2e.BasicPublishSmokeE2ESpec > should publish artifacts to explicitly created stating repository and close and release that particular repository reusing set ID FAILED
    org.gradle.api.GradleException: Build aborted because of an internal error.
        at nebula.test.functional.internal.DefaultExecutionResult.rethrowFailure(DefaultExecutionResult.groovy:97)
        at nebula.test.IntegrationSpec.runTasksSuccessfully(IntegrationSpec.groovy:149)
        at io.codearte.gradle.nexus.functional.e2e.BasicPublishSmokeE2ESpec.should publish artifacts to explicitly created stating repository and close and release that particular repository reusing set ID(BasicPublishSmokeE2ESpec.groovy:27)
        Caused by:
        org.gradle.internal.exceptions.LocationAwareException: Execution failed for task ':releaseRepository'.
            Caused by:
            org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':releaseRepository'.
                Caused by:
                io.codearte.gradle.nexus.infra.NexusHttpResponseException: 502: Bad Gateway, body: <empty>
                    Caused by:
                    groovyx.net.http.ResponseParseException: Bad Gateway
                        Caused by:
                        groovy.json.JsonException: Unable to determine the current character, it is not a string, number, array, or object
                        The current character read is '<' with an int value of 60
                        Unable to determine the current character, it is not a string, number, array, or object
                        line number 1
                        index number 0
                        <html>
                        ^

szpak avatar May 17 '20 13:05 szpak

Note that Maven Central fails quite often. We experience that every 5-10 releases or so.

I wish this retry was implemented.

alvarosanchez avatar May 27 '20 10:05 alvarosanchez

This would be fantastic to have for sure. I see that the AbstractTransitionNexusStagingRepositoryTask uses the internal BasicActionRetrier that leverages the failsafe library to do handle resiliency issues. It seems like we could make another ActionRetrier specific to handling 502 error codes and decorate the calls to the abstract transitionStagingRepo with retry logic in AbstractTransitionNexusStagingRepositoryTask.

ryandens avatar Mar 13 '21 18:03 ryandens

The approach with a decoration could work, but as BasicActionRetrier is only used there I wonder, if wouldn't be better to just try to use handle() method (of handleIf() to make it easier to detect the particular case, if needed) to just handle particular exception (as initially predicted by my comment?

I don't have extensive experience with FailSafe, so I might confuse the things, but if you @ryandens would like to give it a try, I propose to play with the handlers using unit testing first to confirm that and later on try to catch some real exception from Nexus to check how OkHttp handles that (the stacktrace above is from GNSP which was using different network library).

szpak avatar Mar 15 '21 00:03 szpak

:wave: Hey @szpak, I started to hack away at this and learn more about failsafe. Mind sharing some early feedback on my approach when you get the chance? The diff is here

This diff explores the handle approach you described above. Users of the BasicActionRetrier can restrict the retrying to a specific type of exception that works in conjunction with our stop function allowing for a flexible API for retrying. If you like this approach, I can add a more specific compat test to cover this retrying and would probably do a couple of renames/comments to differentiate the ActionRetrier used to retry the transition repository request from the ActionRetrier used to check the state of the transitioning repository

ryandens avatar Mar 27 '21 18:03 ryandens

Thanks for looking into that @ryandens! I'm glad you were able to follow the handle() approach which in my opinion nicely leverage the underlying library. Unfortunately, just going through the code, I realized that I completely forgot that 502 can be returned not just on getStagingRepositoryStateById(), but rather on close/release POST. That made your implementation quite complex. I rethought the idea, I came to the conclusion that it would be more readable to keep those two elements separated. Business retrying - waiting for transition to be completed and technical (network) retrying - on a network errors. My new idea is to:

  1. Keep business retrying as is.
  2. Add an another FailsafeExecutor with a FailsafePolicy reacting just on hardcoded network exceptions (502 at that moment - using your BadGatewayException looks good here). It should wrap all the network calls in NexusClient and repeat harcoded number of times (3?).

Aspects around the public methods in NexusClient would be probably the least intrusive, however, AspectJ in a non-managed environment (or even worse in the Gradle environment) might be somehow problematic. As a result, maybe it would be enough to have an interface NexusClient with some regular implementation (the code we have) + another one as a decorator to apply the aforementioned FailsafePolicy (with FailsafeExecutor) for all methods from the interface. WDYT?

In a case, you have some more elegant approach to solve that, please let us know.

szpak avatar Mar 31 '21 19:03 szpak