renovate icon indicating copy to clipboard operation
renovate copied to clipboard

External Host Error causing renovate to fail for monorepo

Open dbasilio opened this issue 2 years ago • 23 comments

How are you running Renovate?

Mend Renovate hosted app on github.com

If you're self-hosting Renovate, tell us what version of Renovate you run.

No response

Please select which platform you are using if self-hosting.

github.com

If you're self-hosting Renovate, tell us what version of the platform you run.

No response

Was this something which used to work for you, and then stopped?

It used to work, and then stopped

Describe the bug

We maintain a monorepo with 138 packages in them. We used to get renovate PRs for the repo, but it stopped working a while ago (not exactly sure when). I finally got around to checking the logs and it's failing with the error below. The package itself is not that important as it fails on a different package pretty frequently (usually it fails on the first one listed in devDependencies on the global package.json).

The error is not consistent as we've had three successful runs in the last 3 days, but even on successful runs we haven't been getting new PRs. I'm going to guess we're hitting some sort of limit on requests to npm and being throttled, but if that's the case, are there config options we can make to increase the timeout or paginate more or something?

Relevant debug logs

Logs
WARN: Host error
{
  "hostType": "npm",
  "packageName": "react-phone-number-input",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1661790262126,
      "socket": 1661790262127,
      "lookup": 1661790262130,
      "connect": 1661790267339,
      "secureConnect": 1661790267369,
      "upload": 1661790267369,
      "response": 1661790267420,
      "error": 1661790323264,
      "abort": 1661790341747,
      "phases": {
        "wait": 1,
        "dns": 3,
        "tcp": 5209,
        "tls": 30,
        "request": 0,
        "firstByte": 51,
        "total": 79621
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:412:35)\n    at ClientRequest.emit (domain.js:475:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (_http_client.js:475:9)\n    at TLSSocket.emit (events.js:400:28)\n    at TLSSocket.emit (domain.js:475:12)\n    at emitErrorNT (internal/streams/destroy.js:106:8)\n    at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (internal/timers.js:559:11)\n    at processTimers (internal/timers.js:500:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://registry.npmjs.org/react-phone-number-input",
      "hostType": "npm",
      "username": "",
      "password": "",
      "method": "GET",
      "http2": false
    },
    "response": {
      "statusCode": 200,
      "statusMessage": "OK",
      "headers": {
        "date": "Mon, 29 Aug 2022 16:24:27 GMT",
        "content-type": "application/json",
        "transfer-encoding": "chunked",
        "connection": "close",
        "cf-ray": "7426a313295ec378-SEA",
        "access-control-allow-origin": "*",
        "age": "173",
        "cache-control": "public, max-age=300",
        "etag": "W/\"baa4594aa8274c326bcf05b7e8f487db\"",
        "last-modified": "Sat, 27 Aug 2022 08:55:56 GMT",
        "vary": "accept-encoding, accept",
        "cf-cache-status": "HIT",
        "expect-ct": "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"",
        "x-amz-replication-status": "PENDING",
        "server": "cloudflare",
        "content-encoding": "gzip"
      },
      "httpVersion": "1.1"
    }
  }
}

Have you created a minimal reproduction repository?

No reproduction repository

dbasilio avatar Aug 29 '22 16:08 dbasilio

We should analyse these timestamps above to work out if this timeout should be reduced. I'm not sure if that 60s is a timeout for the full file transfer or a timeout failing to connect, for no data received, etc.

Then we should check if we can retry such failures at least once.

rarkins avatar Aug 30 '22 04:08 rarkins

We've a similar issue on our helm charts repo

WARN: Host error
{
  "hostType": "github-releases",
  "packageName": "norwoodj/helm-docs",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1661933000729,
      "socket": 1661933000945,
      "lookup": 1661933108895,
      "error": 1661933108935,
      "phases": {
        "wait": 216,
        "dns": 107950,
        "total": 108206
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:412:35)\n    at ClientRequest.emit (domain.js:475:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (_http_client.js:475:9)\n    at TLSSocket.emit (events.js:400:28)\n    at TLSSocket.emit (domain.js:475:12)\n    at emitErrorNT (internal/streams/destroy.js:106:8)\n    at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (internal/timers.js:559:11)\n    at processTimers (internal/timers.js:500:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/vnd.github.machine-man-preview+json",
        "authorization": "***********",
        "content-type": "application/json",
        "content-length": "621",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://api.github.com/graphql",
      "hostType": "github-releases",
      "username": "",
      "password": "",
      "method": "POST",
      "http2": false
    }
  }
}

viceice avatar Aug 31 '22 08:08 viceice

created a fork at https://github.com/renovate-reproductions/17483-helm-charts-timeout

viceice avatar Aug 31 '22 08:08 viceice

@dbasilio can you try to set a higher timeout^1 or limit concurrency^2 for npm registry?

viceice avatar Aug 31 '22 09:08 viceice

@viceice We've tried a timeout of 120s, and a concurrency limit of 2 individually with no change. Going to try both together

dbasilio avatar Sep 01 '22 14:09 dbasilio

Both settings together don't seem to help, we're still getting the external-host-error more often than note. Anecdotally, it does seem to fail on "later" packages alphabetically

{
    "hostType": "npm",
    "matchHost": "https://registry.npmjs.org/",
    "timeout": 120000,
    "concurrentRequestLimit": 2
}

dbasilio avatar Sep 01 '22 15:09 dbasilio

I'm going to use this issue as a general performance tuning issue. It seems some of the issues are environmental/app-related, because we can't replicate them when running locally yet they occur consistently in the hosted app. This may be because of:

  • Docker itself
  • The Ubuntu base image
  • AWS infra e.g. DNS
  • Memory or CPU limitations

rarkins avatar Sep 02 '22 08:09 rarkins

Some useful links:

  • https://medium.com/swlh/solving-node-dns-issues-and-other-things-5051d8526cac
  • https://medium.com/trendyol-tech/nodejs-performance-best-practices-more-than-cliche-9baa573cbf03
  • https://oracle.github.io/node-oracledb/doc/api.html#numberofthreads
  • https://dev.to/johnjardincodes/increase-node-js-performance-with-libuv-thread-pool-5h10
  • https://github.com/node-modules/agentkeepalive
  • https://medium.com/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95

rarkins avatar Sep 02 '22 08:09 rarkins

This tried so far:

  • [x] Custom DNS cache
  • [x] Custom DNS settings in env: RES_OPTIONS: 'ndots:3 retrans:1000 retry:3 rotate'
  • [x] Enable http2 for api.github.com
  • [x] Enable http2 for public npm registry

rarkins avatar Sep 02 '22 08:09 rarkins

Since it's been running for a few days now with

{
    "hostType": "npm",
    "matchHost": "https://registry.npmjs.org/",
    "timeout": 120000,
    "concurrentRequestLimit": 2
}

I've noticed that it started hitting a timeout more than the external host error. We are still getting the occasional success but most often it fails. image

dbasilio avatar Sep 06 '22 14:09 dbasilio

Looking at the last "done" run, here are the starts:

  "hostStats": {
    "api.github.com": {
      "requestCount": 36,
      "requestAvgMs": 460,
      "queueAvgMs": 0
    },
    "registry.npmjs.org": {
      "requestCount": 103,
      "requestAvgMs": 19024,
      "queueAvgMs": 0
    },
    "xxx.jfrog.io": {
      "requestCount": 4,
      "requestAvgMs": 23116,
      "queueAvgMs": 0
    }
  },

Even though concurrentRequestLimit was configured, we see no queueAvgMs for registry.npmjs.org plus very high requestAvgMs for both npm and jfrog.

rarkins avatar Sep 09 '22 04:09 rarkins

I found the queue problem, see #17713. Please remove hostType from your host rule

rarkins avatar Sep 09 '22 04:09 rarkins

Updated my hostrule to:

{
    "matchHost": "registry.npmjs.org/",
    "concurrentRequestLimit": 2,
    "timeout": 120000
,

(and then the same for our jfrog registry as well in a separate host rule.

We're still getting mostly external host errors image

dbasilio avatar Sep 12 '22 16:09 dbasilio

Try removing all limits again

rarkins avatar Sep 12 '22 17:09 rarkins

No change image

dbasilio avatar Sep 13 '22 15:09 dbasilio

What's the details for the errors you get now? Eg ETIMEDOUT, ECONNRESET, etc. pasting details here would be helpful

rarkins avatar Sep 13 '22 15:09 rarkins

External Host Error

{
  "hostType": "npm",
  "packageName": "@datadog/browser-logs",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1663081515075,
      "socket": 1663081515076,
      "lookup": 1663081519436,
      "connect": 1663081519436,
      "secureConnect": 1663081522929,
      "upload": 1663081522931,
      "response": 1663081523051,
      "error": 1663081595475,
      "abort": 1663081635691,
      "phases": {
        "wait": 1,
        "dns": 4360,
        "tcp": 0,
        "tls": 3493,
        "request": 2,
        "firstByte": 120,
        "total": 120616
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (node:events:628:26)\n    at ClientRequest.emit (node:events:525:35)\n    at ClientRequest.emit (node:domain:489:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (node:_http_client:481:9)\n    at TLSSocket.emit (node:events:513:28)\n    at TLSSocket.emit (node:domain:489:12)\n    at emitErrorNT (node:internal/streams/destroy:157:8)\n    at emitErrorCloseNT (node:internal/streams/destroy:122:3)\n    at processTicksAndRejections (node:internal/process/task_queues:83:21)\n    at runNextTicks (node:internal/process/task_queues:65:3)\n    at listOnTimeout (node:internal/timers:528:9)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (node:internal/timers:561:11)\n    at processTimers (node:internal/timers:502:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://registry.npmjs.org/@datadog%2Fbrowser-logs",
      "hostType": "npm",
      "username": "",
      "password": "",
      "method": "GET",
      "http2": false
    },
    "response": {
      "statusCode": 200,
      "statusMessage": "OK",
      "headers": {
        "date": "Tue, 13 Sep 2022 15:05:22 GMT",
        "content-type": "application/json",
        "transfer-encoding": "chunked",
        "connection": "close",
        "cf-ray": "74a1c7de2db7680a-SEA",
        "access-control-allow-origin": "*",
        "age": "77",
        "cache-control": "public, max-age=300",
        "etag": "W/\"3786e68d0270801401b11f9b4ac6d624\"",
        "last-modified": "Tue, 13 Sep 2022 11:35:42 GMT",
        "vary": "accept-encoding, accept",
        "cf-cache-status": "HIT",
        "x-amz-replication-status": "COMPLETED",
        "server": "cloudflare",
        "content-encoding": "gzip"
      },
      "httpVersion": "1.1"
    }
  }
}

And this is all the INFO level logs for the timeout I posted above

INFO: Repository started
{
  "renovateVersion": "32.194.5"
}
INFO: Dependency extraction complete
{
  "baseBranch": "master",
  "stats": {
    "managers": {
      "dockerfile": {
        "fileCount": 1,
        "depCount": 1
      },
      "npm": {
        "fileCount": 140,
        "depCount": 4456
      },
      "nvm": {
        "fileCount": 1,
        "depCount": 1
      }
    },
    "total": {
      "fileCount": 142,
      "depCount": 4458
    }
  }
}
WARN: Job timeout

dbasilio avatar Sep 13 '22 18:09 dbasilio

Have you configured a 60s timeout?

rarkins avatar Sep 13 '22 21:09 rarkins

We do not have a timeout configured anywhere in our config

dbasilio avatar Sep 13 '22 21:09 dbasilio

We've had a 4 minute timeout configured for a number of days. Strange that your log shows 60s - when was it captured?

Note: the file in question shouldn't take so long either way

rarkins avatar Sep 13 '22 21:09 rarkins

These are the logs for the last day, with no change to our config: image

The errors were respectively:

  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
  • Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
  • Fetching @sentry/types - Client network socket disconnected before secure TLS connection was established
  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
  • Fetching @sentry/types - Timeout awaiting 'request' for 60000ms

All runs used renovate version 32.194.5. I also double confirmed that we are not specifying a timeout anywhere in our shared configs.

dbasilio avatar Sep 14 '22 19:09 dbasilio

Strange, it seems to be inheriting the default 60s hostRules.timeout instead of the 240s rule we've set in the app. Any change if you then try to apply a longer timeout in your own config? e.g.

"hostRules": [{
  "timeout": 240000
}]

rarkins avatar Sep 16 '22 09:09 rarkins

We've had it running all weekend with a 4m timeout and this is what it looks like image

The failures were (in order):

  • @types/svg-sprite-loader - Client network socket disconnected before secure TLS connection was established
  • @yarnpkg/shell - socket hang up
  • @types/event-emitter - Client network socket disconnected before secure TLS connection was established
  • @datadog/browser-logs - socket hang up
  • @juggle/resize-observer - socket hang up
  • monodeploy - Client network socket disconnected before secure TLS connection was established
  • @types/react-tooltip - Client network socket disconnected before secure TLS connection was established
  • @svgr/plugin-svgo - socket hang up
  • the last two errors were before the 4m change

dbasilio avatar Sep 19 '22 14:09 dbasilio

We've been running with the 4m change for a week now, we're still seeing intermittent failures but it's now completing often enough that we're getting PRs through so most of my concern has been alleviated.

Is there a problem with increasing the timeout further? Like 5 or 6 minutes? Is there a max on the timeout? I don't want to further increase the timeout if it's going to negatively impact the running of the app.

image

dbasilio avatar Sep 26 '22 14:09 dbasilio

No problem with you trying a higher number. Something is clearly going wrong somewhere in the chain and causing rate limiting, but it's good if it eventually completes.

rarkins avatar Sep 26 '22 14:09 rarkins

@dbasilio I'm seeing only occasional failures for your repo now, can you confirm the same from your side?

Only ~1 failure in the last day, for the record it was still ECONNRESET, and I'm surprised to see that Renovate did retry 2 times:

{"level":20,"msg":"GET https://registry.npmjs.org/@yarnpkg%2Fcore = (code=ECONNRESET, statusCode=-1 retryCount=2, duration=17249)","time":"2022-10-22T14:06:04.321Z"}
{"level":20,"msg":"GET https://registry.npmjs.org/@yarnpkg%2Ffslib = (code=ECONNRESET, statusCode=-1 retryCount=2, duration=17446)","time":"2022-10-22T14:06:04.515Z"}

rarkins avatar Oct 23 '22 05:10 rarkins

We're still seeing a decent amount of failures, but it's completing frequently enough that we're at least getting PRs put up image

dbasilio avatar Oct 27 '22 15:10 dbasilio

We're seeing success often enough now that I think we can close this issue out image Thanks for all the help!

dbasilio avatar Nov 30 '22 14:11 dbasilio