go
go copied to clipboard
services/horizon: prevent extra error log output due to invocations of core status on closed ctx
PR Checklist
PR Structure
- [x] This PR has reasonably narrow scope (if not, break it down into smaller PRs).
- [x] This PR avoids mixing refactoring changes with feature changes (split into two PRs otherwise).
- [x] This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot
, orall
ordoc
if the changes are broad or impact many packages.
Thoroughness
- [x] This PR adds tests for the most critical parts of the new functionality or fixes.
- [ ] I've updated any docs (developer docs,
.md
files, etc... affected by this change). Take a look in thedocs
folder for a given service, like this one.
Release planning
- [ ] I've updated the relevant CHANGELOG (here for Horizon) if needed with deprecations, added features, breaking changes, and DB schema changes.
- [ ] I've decided if this PR requires a new major/minor version according to semver, or if it's mainly a patch change. The PR is targeted at the next release branch if it's not a patch change.
What
when debugging prior issues on staging, noticed the logs tended to be full of these types of outputs:
time="2022-03-03T17:45:36.920Z" level=warning msg="could not load stellar-core info: http request errored: Get \"http://127.0.0.1:11626/info\": context deadline exceeded" pid=1443500
time="2022-03-03T17:45:36.920Z" level=error msg="failed to load the stellar-core info" err="http request errored: Get \"http://127.0.0.1:11626/info\": context deadline exceeded" pid=1443500 stack="[main.go:43 client.go:67 app.go:223 app.go:435 asm_amd64.s:1581]"
time="2022-03-03T17:45:38.085Z" level=error msg="Cannot connect to Captive Stellar-Core HTTP server" error="http request errored: Get \"http://localhost:11626/info\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" pid=1443500 service=ingest
Why
most of the logged error output is related to prometheus metrics exporting attempts and probably not providing any value when in this state, since the context is closed, don't need to make the http attempts which fast-fail due to the ctx anyways
Closes #4259
Known limitations
After looking at it again: I'm not sure if the log outputs you share will be gone after this change because it seems that once the methods start to be executed the context is not cancelled/closed yet.
After looking at it again: I'm not sure if the log outputs you share will be gone after this change because it seems that once the methods start to be executed the context is not cancelled/closed yet.
Ok, here's a log excerpt to see this pattern in effect, the other way these could be hitting deadline errors if the caller's ctx isn't already deadlined is that core's http service isn't responding and really timing out?
In any of these cases, maybe the err log output at this point is not super helpful, since the err is being throw back to caller, ultimately caller would log something higher in stack if it is really meaningful, i.e. another suggestion I'm wondering about - what If we just remove the err log line in these 3 instances instead?
[edit] disregard my statement there about:
since the err is being throw back to caller, ultimately caller would log something higher in stack if it is really meaningful
since that is not correct, these instances are just short-circuiting at point of failure after attempting to get core status info from http request and returning with no error. I still am suggesting if we can consider to remove that error log line output, as not sure on it's overall value for troubleshooting/diagnostic.
@bartekn
wanted to follow up on your feedback, the log output from these points refers to context deadline exceeded
, but it's not clear whether it originates from the caller repeatedly passing an expired ctx or if the core http service was unresponsive and timed out repeatedly. If this change to reduce that log output pattern seems marginal or non-improvement to that case, I will close this pr.
@sreuland unless you have any concrete plan to keep working on this one, could you close this PR ?
yes, this effort was more of a poc, didn't seem to add intended value for the change and has rotted, will close.