loki icon indicating copy to clipboard operation
loki copied to clipboard

promtail/server: Disable profiling by default

Open BenoitKnecht opened this issue 3 years ago • 18 comments

What this PR does / why we need it:

Add a server.debug option (that defaults to false) to enable the /debug/fgprof endpoint in Promtail.

Which issue(s) this PR fixes: Fixes #5005

Checklist

  • [x] Documentation added
  • [ ] Tests updated
  • [x] Add an entry in the CHANGELOG.md about the changes.

BenoitKnecht avatar Feb 15 '22 15:02 BenoitKnecht

Given that profiling is always on at the moment and we make use of it ourselves I think it might be better to default to enabled and allow disabling via the config option you're adding here.

cstyan avatar Feb 16 '22 02:02 cstyan

So I can rename the option. Personally, I prefer simply profiling, as the _enabled suffix doesn't really carry much information. But I don't feel too strongly about it, so let me know what you prefer.

As for the default value, I'm surprised that there seems to be a majority of people arguing to keep it enabled by default. It seems very counter-intuitive to me that a software would run with debugging endpoints enabled by default. This is almost certainly not what most people want to do in production, and the principle of least surprise should apply here. For use cases where it's desirable to have them enabled, it can be done explicitly.

Finally, I can modify this PR to also disable profiling in Loki, but I don't know about all the other Cortex or Tempo projects that may be affected as well.

BenoitKnecht avatar Feb 22 '22 13:02 BenoitKnecht

As for the default value, I'm surprised that there seems to be a majority of people arguing to keep it enabled by default. It seems very counter-intuitive to me that a software would run with debugging endpoints enabled by default. This is almost certainly not what most people want to do in production, and the principle of least surprise should apply here. For use cases where it's desirable to have them enabled, it can be done explicitly.

IMO, when running a distributed system like Loki there are issues that occur rarely, and having to explicitly enable instrumentation and then wait for that issue to pop up again in order to have the information you'll need in order to debug and diagnose is counter-intuitive.

cstyan avatar Feb 22 '22 16:02 cstyan

I think there is a bit of difference between server applications like Loki and Tempo, and promtail which is styled as an agent. You know you need to protect your servers from unwanted incoming calls; this may not be obvious for an agent.

BTW I am a maintainer of weaveworks/common and can look at a PR to make profiling optional in the “server” package there.

bboreham avatar Feb 22 '22 16:02 bboreham

I think there is a bit of difference between server applications like Loki and Tempo, and promtail which is styled as an agent. You know you need to protect your servers from unwanted incoming calls; this may not be obvious for an agent.

That's very true! In our case, Loki is behind an authenticating proxy, so the /debug/* endpoints are not accessible anyway, but promtail is deployed pretty much everywhere so it's quite impractical to restrict access to those endpoints systematically.

BTW I am a maintainer of weaveworks/common and can look at a PR to make profiling optional in the “server” package there.

Are you saying you'd be open to PR submissions, or did you want to take a stab at it yourself?

BenoitKnecht avatar Feb 23 '22 12:02 BenoitKnecht

Are you saying you'd be open to PR submissions, or did you want to take a stab at it yourself?

The former.

bboreham avatar Feb 23 '22 13:02 bboreham

Somehow during all of this I missed that this PR is for promtail.

I would still say that we should keep the default as profiling on for other components (loki microservices) but feel less strongly about it for promtail itself.

cstyan avatar Feb 25 '22 21:02 cstyan

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Apr 16 '22 03:04 stale[bot]

I think with an additional commit to fix the changelog conflict we could merge this?

cstyan avatar Apr 20 '22 05:04 cstyan

@cstyan I rebased on main, so the conflict should be fixed now.

BenoitKnecht avatar Apr 28 '22 09:04 BenoitKnecht

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Jun 12 '22 16:06 stale[bot]

please keep it open

darxriggs avatar Jun 12 '22 16:06 darxriggs

I've rebased again; could someone merge it, or is there anything else?

BenoitKnecht avatar Jul 07 '22 05:07 BenoitKnecht

@chaudum @cyriltovena Could you take a look? It would be great to get this merged soon.

BenoitKnecht avatar Jul 20 '22 09:07 BenoitKnecht

Seems that we agreed that we want to disable the profiling endpoints by default, so basically good to go.

However, regarding the configuration parameter naming, I would want to change it to profiling_enabled, with the _enabled suffix. I'm not sure if we are 100% consistent, but most of the config options that enable/disable stuff have the suffix as well.

Then I'm happy to merge the PR :)

chaudum avatar Jul 21 '22 06:07 chaudum

@BenoitKnecht any updates? Would you be able to make those last few changes requested by @chaudum? Thanks!

trevorwhitney avatar Sep 09 '22 15:09 trevorwhitney

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot avatar Sep 15 '22 05:09 grafanabot

@trevorwhitney Sorry for the huge delay. I finally got around to make the requested changes. Let me know if there's anything else.

BenoitKnecht avatar Sep 15 '22 05:09 BenoitKnecht

I've rebased once again. Can we get this merged before it diverges again?

BenoitKnecht avatar Oct 25 '22 07:10 BenoitKnecht

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot avatar Nov 08 '22 22:11 grafanabot

@trevorwhitney There's been two releases of Loki since you merged this into main, but somehow this change isn't part of either of them. Is there something I need to do to get it pulled into the next release?

BenoitKnecht avatar Jan 27 '23 08:01 BenoitKnecht

Two patch releases. It will be in the next minor release.

bboreham avatar Jan 27 '23 14:01 bboreham

@BenoitKnecht FYI https://github.com/grafana/loki/issues/8797, looking like the end of next week, or early the week after for release.

trevorwhitney avatar Mar 16 '23 20:03 trevorwhitney

@BenoitKnecht FYI #8797, looking like the end of next week, or early the week after for release.

Hey @trevorwhitney, that's great news, thanks for letting me know!

BenoitKnecht avatar Mar 20 '23 14:03 BenoitKnecht