go icon indicating copy to clipboard operation
go copied to clipboard

services/horizon: prevent extra error log output due to invocations of core status on closed ctx

Open sreuland opened this issue 2 years ago • 3 comments

PR Checklist

PR Structure

  • [x] This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • [x] This PR avoids mixing refactoring changes with feature changes (split into two PRs otherwise).
  • [x] This PR's title starts with name of package that is most changed in the PR, ex. services/friendbot, or all or doc if the changes are broad or impact many packages.

Thoroughness

  • [x] This PR adds tests for the most critical parts of the new functionality or fixes.
  • [ ] I've updated any docs (developer docs, .md files, etc... affected by this change). Take a look in the docs folder for a given service, like this one.

Release planning

  • [ ] I've updated the relevant CHANGELOG (here for Horizon) if needed with deprecations, added features, breaking changes, and DB schema changes.
  • [ ] I've decided if this PR requires a new major/minor version according to semver, or if it's mainly a patch change. The PR is targeted at the next release branch if it's not a patch change.

What

when debugging prior issues on staging, noticed the logs tended to be full of these types of outputs:

time="2022-03-03T17:45:36.920Z" level=warning msg="could not load stellar-core info: http request errored: Get \"http://127.0.0.1:11626/info\": context deadline exceeded" pid=1443500

time="2022-03-03T17:45:36.920Z" level=error msg="failed to load the stellar-core info" err="http request errored: Get \"http://127.0.0.1:11626/info\": context deadline exceeded" pid=1443500 stack="[main.go:43 client.go:67 app.go:223 app.go:435 asm_amd64.s:1581]"

time="2022-03-03T17:45:38.085Z" level=error msg="Cannot connect to Captive Stellar-Core HTTP server" error="http request errored: Get \"http://localhost:11626/info\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" pid=1443500 service=ingest

Why

most of the logged error output is related to prometheus metrics exporting attempts and probably not providing any value when in this state, since the context is closed, don't need to make the http attempts which fast-fail due to the ctx anyways

Closes #4259

Known limitations

sreuland avatar Mar 03 '22 23:03 sreuland

After looking at it again: I'm not sure if the log outputs you share will be gone after this change because it seems that once the methods start to be executed the context is not cancelled/closed yet.

bartekn avatar Mar 08 '22 08:03 bartekn

After looking at it again: I'm not sure if the log outputs you share will be gone after this change because it seems that once the methods start to be executed the context is not cancelled/closed yet.

Ok, here's a log excerpt to see this pattern in effect, the other way these could be hitting deadline errors if the caller's ctx isn't already deadlined is that core's http service isn't responding and really timing out?

In any of these cases, maybe the err log output at this point is not super helpful, since the err is being throw back to caller, ultimately caller would log something higher in stack if it is really meaningful, i.e. another suggestion I'm wondering about - what If we just remove the err log line in these 3 instances instead?

[edit] disregard my statement there about:

since the err is being throw back to caller, ultimately caller would log something higher in stack if it is really meaningful

since that is not correct, these instances are just short-circuiting at point of failure after attempting to get core status info from http request and returning with no error. I still am suggesting if we can consider to remove that error log line output, as not sure on it's overall value for troubleshooting/diagnostic.

sreuland avatar Mar 08 '22 18:03 sreuland

@bartekn

wanted to follow up on your feedback, the log output from these points refers to context deadline exceeded , but it's not clear whether it originates from the caller repeatedly passing an expired ctx or if the core http service was unresponsive and timed out repeatedly. If this change to reduce that log output pattern seems marginal or non-improvement to that case, I will close this pr.

sreuland avatar Mar 21 '22 17:03 sreuland

@sreuland unless you have any concrete plan to keep working on this one, could you close this PR ?

tsachiherman avatar Mar 16 '23 19:03 tsachiherman

yes, this effort was more of a poc, didn't seem to add intended value for the change and has rotted, will close.

sreuland avatar Mar 16 '23 20:03 sreuland