wrapper-validation-action icon indicating copy to clipboard operation
wrapper-validation-action copied to clipboard

Action fails regularly due to ETIMEDOUT and ECONNRESET

Open ZacSweers opened this issue 3 years ago • 30 comments

Example runs:

https://github.com/square/anvil/pull/266/checks?check_run_id=2589215352

https://github.com/square/anvil/pull/266/checks?check_run_id=2589215611

I've seen this flakey behavior happen somewhat often in the past few weeks, not sure what else is going on so filing this as an FYI.

ZacSweers avatar May 15 '21 06:05 ZacSweers

This may be the same as #33

Hopefully, #39 having been merged will resolve this. @eskatos can you perform a release to see if that helps resolve this issue for our users?

JLLeitschuh avatar May 18 '21 15:05 JLLeitschuh

Is there anything else needed for a release that I could maybe help with? This makes most of our workflows unusable

ZacSweers avatar May 25 '21 03:05 ZacSweers

@ZacSweers I believe that you can try out this action from a commit hash. You may want to give that a shot as a stopgap?

JLLeitschuh avatar May 25 '21 14:05 JLLeitschuh

Using ef08c6885017f258a11d59e0da103ed39424aa6b appears to resolve things for us. I'd recommend a new 1.x release tag to de-flake things for folks, we were definitely considering dropping this otherwise and I'm not sure how willing folks are to point at a direct sha

ZacSweers avatar May 27 '21 18:05 ZacSweers

Should be published now as v1

JLLeitschuh avatar May 28 '21 14:05 JLLeitschuh

Thanks!

ZacSweers avatar May 28 '21 14:05 ZacSweers

We're still seeing this unfortunately, albeit less often and just as this

Run gradle/wrapper-validation-action@v1
Error: read ECONNRESET

ZacSweers avatar Jun 24 '21 05:06 ZacSweers

Here's an example run https://github.com/ZacSweers/MoshiX/pull/128/checks?check_run_id=2921425588

ZacSweers avatar Jun 26 '21 15:06 ZacSweers

This happens pretty consistently across the projects I work on, unfortunately I think we're going to have to remove this action as a result as it's a reliability issue

ZacSweers avatar Jul 16 '21 02:07 ZacSweers

Unfortunately, we don't have enough information at this time to understand what's causing this issue.

Are you using self-hosted runners, or runners hosted by GH?

JLLeitschuh avatar Aug 30 '21 14:08 JLLeitschuh

I see this often on GH hosted runners, often in the square/anvil repo

ZacSweers avatar Aug 30 '21 15:08 ZacSweers

@eskatos is there any way to add additional log output on failure so that we can work on understanding the root cause?

JLLeitschuh avatar Aug 30 '21 17:08 JLLeitschuh

This is the error that I'm getting: Error: connect ETIMEDOUT 104.18.164.99:443 image

nkvaratskhelia avatar Sep 21 '21 11:09 nkvaratskhelia

Github hosted actions here. When the action fails with this error it fails across all active runs around the same time. About 30 minutes ago 3 runs failed simultaneously. Retried each about 20 minutes ago and they all passed.

jameswald avatar Sep 21 '21 13:09 jameswald

Github hosted actions here. When the action fails with this error it fails across all active runs around the same time. About 30 minutes ago 3 runs failed simultaneously. Retried each about 20 minutes ago and they all passed.

That seems like something that absolutely indicates a cloudflare issue.

JLLeitschuh avatar Sep 22 '21 14:09 JLLeitschuh

Okay, all this finally sent me down the right path, I think I may have finally figured out what's going on here. It looks like our Cloudflare WAF is being triggered every once and awhile randomly and is causing a bunch of users connections to fail when it does. I need to talk to @eskatos about how we want to mitigate this issue. Thanks everyone for helping us figure out what was going wrong here.

Screen Shot 2021-09-22 at 10 40 12 AM

JLLeitschuh avatar Sep 22 '21 14:09 JLLeitschuh

The fix has been implemented.

Please let us know if any of you continue to experience these problems. I hope this will fix the issue, but we have some additional things we can fiddle with if this continues to be a problem.


FOR INTERNAL TRACKING (not public): https://github.com/gradle/gradle-private/issues/3435

JLLeitschuh avatar Sep 23 '21 17:09 JLLeitschuh

Facing a similar issue. A two-line change to a class causes failures with these actions in the following runs:

  1. https://github.com/AY2122S1-CS2103-T14-2/tp/actions/runs/1386259288/attempts/2 Here, it shows ETIMEDOUT
  2. https://github.com/AY2122S1-CS2103-T14-2/tp/actions/runs/1386259288/attempts/1 Here, it shows Client network socket disconnected before secure TLS connection was established

jivesh avatar Oct 26 '21 16:10 jivesh

Seeing the same issue here.

https://github.com/MinimallyCorrect/Mixin/runs/4041503110?check_suite_focus=true

Can the team publish a single file with all the hashes instead of having it fetch hundreds of files with each hash? This is only going to increase in frequency as the number of requests needed goes up with every release.

https://github.com/gradle/wrapper-validation-action/blob/84d7e182ae7c7a37f200c184f64038fb0e62dd7d/src/checksums.ts#L28

LunNova avatar Oct 29 '21 00:10 LunNova

It's not possible for us to know what version you have locally, so we have to fetch all of them.

Ill take a look at our Cloudflare logs and see if this is being caused by our infrastructure/firewall. Thanks for the ping 🙂

JLLeitschuh avatar Oct 29 '21 12:10 JLLeitschuh

Also ran into this right now (and yesterday), re-triggered the job, then it worked:

Run gradle/[email protected]
  with:
    min-wrapper-count: 1
    allow-snapshots: false
Error: Client network socket disconnected before secure TLS connection was established

GitHub hosted action... Let me know if I can provide any more data that helps with this!

codecholeric avatar Oct 29 '21 16:10 codecholeric

@JLLeitschuh I was thinking adding the checksum inline to https://services.gradle.org/versions/all:

{
  "version" : "7.3-20211027231204+0000",
  "buildTime" : "20211027231204+0000",
  "current" : false,
  "snapshot" : true,
  "nightly" : false,
  "releaseNightly" : true,
  "activeRc" : false,
  "rcFor" : "",
  "milestoneFor" : "",
  "broken" : false,
  "downloadUrl" : "https://services.gradle.org/distributions-snapshots/gradle-7.3-20211027231204+0000-bin.zip",
  "checksumUrl" : "https://services.gradle.org/distributions-snapshots/gradle-7.3-20211027231204+0000-bin.zip.sha256",
  "wrapperChecksumUrl" : "https://services.gradle.org/distributions-snapshots/gradle-7.3-20211027231204+0000-wrapper.jar.sha256",
  "wrapperChecksum": "33ad4583fd7ee156f533778736fa1b4940bd83b433934d1cc4e9f608e99a6a89"
  // (The checksum would actually be shorter than the URL for where to go fetch it. ;))
},

Since the only field that gets used at the moment is the wrapper checksum, it might even be worth making a more specialized endpoint which is just a list of all wrapper checksums.

I have no idea where the code that generates/serves these is.

LunNova avatar Oct 29 '21 16:10 LunNova

I've been running into this issue occasionally ever since I integrated this action, but today it's been happening like 60% of the time on macOS on CI (CI also runs on Windows and Linux, but both seem fine).

I recently upgraded to Gradle 7, in case that's relevant.

gnarea avatar Nov 05 '21 11:11 gnarea

So, I've checked, and it's not our WAF causing theses issues. I'm not certain what would be causing these issues otherwise.

JLLeitschuh avatar Nov 05 '21 17:11 JLLeitschuh

Running into the same issue today. Any updates on this?

nhouser9 avatar Nov 14 '21 20:11 nhouser9

also seeing this issue a few times every day, retrying tends to work straight away

2021-11-24T11:22:49,356896107+00:00

https://github.com/vector-im/element-android/actions/workflows/gradle-wrapper-validation.yml?query=is%3Afailure

ouchadam avatar Nov 24 '21 11:11 ouchadam

Seeing this a lot on Paparazzi builds, mainly with Windows workers Example run: https://github.com/cashapp/paparazzi/runs/4316309670?check_suite_focus=true

jrodbx avatar Nov 24 '21 20:11 jrodbx

Is there any further update on this as it keeps failing sporadically on both windows-2022 and ubuntu-20.04 action runs.

The-Code-Monkey avatar Dec 04 '21 14:12 The-Code-Monkey

I also keep getting CI failures due to this issue: Error: connect ETIMEDOUT 104.18.165.99:443. This might be silly, but since relaunching usually fixes, I wonder whether just allowing for three retries or so could be helpful. In the end, connection timeouts may happen, unless some destination is unreachable it may make sense not to fail immediately.

DanySK avatar Dec 04 '21 22:12 DanySK

We do have retry logic enabled. https://github.com/gradle/wrapper-validation-action/blob/84d7e182ae7c7a37f200c184f64038fb0e62dd7d/src/checksums.ts#L6

That being said, I have no evidence that it's actually working. A PR to improve debug logging from the community would be welcomed openly. Especially if it were implemented such that the additional logging was only printed when the build was going to fail anyways. I'd prefer to not make the action more chatty than it needs to be when it's not going to fail. I think the biggest problem we currently have is a severe lack of visibility. As such it makes it really difficult to figure out a root cause for these issues.

JLLeitschuh avatar Dec 08 '21 18:12 JLLeitschuh