envoy RFC admin: hierarchical stats viewer

Commit Message: Adds default hierarchical view for admin stats accessed by clicking 'stats' from the admin home page. Note that the existing /stats API is unchanged.

This is not ready for approval/submit yet as the new endpoints lack unit tests, and the UI (if we keep it) is ugly (unstyled). But I wanted to get some comments on the concept, particularly as it relates to solving #18675 before I clean all that up.

Also needed before this PR can be merged:

proper escaping of special chars embedded in scope names, wherever arbitrary text is inserted via C++ or JS.
When navigating within the outline, record HTML history
Add JS tests of some sort
Review Scope lifetime and make sure we don't risk referencing scopes as they are being deleted by a config update
When "usedonly" mode is on, find some mechanism to filter out scopes with no used stats. Same for regex filtering.
Break up into about 3 PRs.
Add option declarations for other admin pages besides stats.
test coverage for stats_hander.cc and admin_html_generator.cc.

Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Fixes: #18675

Oct 19 '21 00:10 jmarantz

As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/18670 was opened by jmarantz.

see: more, trace.

Oct 19 '21 00:10 repokitteh-read-only[bot]

/retest

Oct 19 '21 01:10 jmarantz

Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/18670#issuecomment-946285523 was created by @jmarantz.

see: more, trace.

Oct 19 '21 01:10 repokitteh-read-only[bot]

I think for only solving browser memory, the existing filtering on the stats endpoint would be sufficient to reduce the results size in a flexible way.

I'm assuming you took this approach because on the server side, using a regex to filter all the stats is expensive. Is that correct?

One alternative idea to allow more flexibility in which stats are grouped together: Let the user define a list of groups in their bootstrap config, and use existing filtering methods to add stats to zero or more groups (selected at stat creation time), and add an api for querying by group name.

My concern with this approach is that we're tying an external admin API to an internal code API used for managing lifetimes.

Oct 20 '21 21:10 ggreenway

/wait-any

Oct 20 '21 21:10 ggreenway

You are right: browser memory is solved with filtering, but there's 2 issues:

you might not know what to filter until you've already hit /stats and gotten yourself in trouble
if you are hitting the endpoint by visiting the admin-port in the browser and just clicking, there's no convenient way to know you can add the ?filter= argument. I'm thinking a bit richer UI on the main admin page might help.

And still doesn't solve server-side memory bloat (with store.counters() aggregating all of them), or the potentially 0.5 seconds you lock up the stats-allocator mutex and the main thread running through all the data, with the potential impact on p99 latency.

Oct 20 '21 21:10 jmarantz

you might not know what to filter until you've already hit /stats and gotten yourself in trouble

Adding another way to get stats won't solve that problem. I think the only way to solve that is to by default have a limit on the number of stats returned, with a query parameter to override.

if you are hitting the endpoint by visiting the admin-port in the browser and just clicking, there's no convenient way to know you can add the ?filter= argument. I'm thinking a bit richer UI on the main admin page might help.

Yeah, I think improving that webpage would be an easy win in that case. Have the link by default get /stats?limit=1000 or something (and add limit to the /stats handler).

And still doesn't solve server-side memory bloat (with store.counters() aggregating all of them), or the potentially 0.5 seconds you lock up the stats-allocator mutex and the main thread running through all the data, with the potential impact on p99 latency.

Agreed. I think it makes a lot of sense to have a way to reduce server resources for fetching huge numbers of stats.

Oct 20 '21 22:10 ggreenway

I think for limit=1000 to really help we need to change the map in Stats::AllocatorImpl from absl::flat_hash_map to absl::btree_map, so the 1000 you get are not just some random assortment based on the hash order.

That would also enable a paging interface.

But I really like the idea of having the console-based default be limited like that. That way the raw "/stats" endpoint doesn't need to change its behavior, which might be a breaking change for some deployments and their scripts.

Oct 21 '21 00:10 jmarantz

Here's another idea: have a way for the client to enumerate which stats to return for each call.

The flow would be:

Client does a call to get the list of all the stat names
Client calls an api that takes an argument for the list of stats to return, so the client can implement it's own pagination/batching/whatever. This would also allow efficiently polling different stats at different intervals.

I'm thinking the api would take a list of exact names, not regular expressions or patterns, so it would be very efficient to fetch them.

This solution wouldn't require any changes to the stats infrastructure, just some new admin APIs.

Nov 17 '21 18:11 ggreenway

@ggreenway sorry I missed your suggestion above. How would the client know which names to ask for?

I agree though that having an efficient admin query for a specific list of stat-names would be helpful as it would be O(NumNames) rather than having to iterate over the store.

But I don't see how you can build a paginated interface over that.

Dec 05 '21 18:12 jmarantz

How would the client know which names to ask for?

We'd need a simple API to get the list of all stat names, which would only need to be called once after a config change.

But I don't see how you can build a paginated interface over that.

It's similar to asking for specific scopes, except the client would have to ask for specific stat names. This way, we're not exposing scopes (an internal detail) to the API.

That being said, I don't know if it would solve your problem or not.

/wait-any

Dec 14 '21 19:12 ggreenway

I'm still not sure I like the idea of using an internal concept of Stat Scope and exposing it externally. This limits our ability to make internal changes in the future. What if the notion of Scope as it currently exists goes away entirely? At a minimum, we'd need to clearly document that there is no guarantee that the names of the scopes, or which stats are in which scopes, or how many stats are in each scope, are stable between releases.

@mattklein123 any opinions on this?

Jan 06 '22 16:01 ggreenway

FWIW I have forked this PR into https://github.com/envoyproxy/envoy/pull/19413 which adds paging, but drops the exposure of scopes. I think that PR still has a ways to go in terms of code simplification, test coverage, and breaking it up into a bunch of incremental PRs that are reviewable.

But I'll drop this scoping PR for now; I think it's not needed.

Jan 06 '22 16:01 jmarantz

Per slack discussion I am re-opening this idea of navigating based on scope, but we can avoid making the scopes explicit to users. 2 options:

pagination based on scopes without really describing what they are
hierarchical view based on each token in the stat name, so that "cluster1" could be expanded one segment at a time. This would, behind the scenes, use scopes to bound the amount of stats that need to come int memory during each operation.

The pagination based on scopes would be easier. We could do the hierarchical view as a follow-up if that's useful.

Jan 07 '22 00:01 jmarantz

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Feb 21 '22 16:02 github-actions[bot]

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Mar 02 '22 00:03 github-actions[bot]

Not stale -- just needs some other things to merge first :)

Mar 02 '22 00:03 jmarantz

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Apr 01 '22 04:04 github-actions[bot]

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Apr 08 '22 08:04 github-actions[bot]

I think this will actually see the light of day; probably it makes sense to merge-main on this and open it up after https://github.com/envoyproxy/envoy/pull/19546 lands, which will probably be a bit longer as it's got some sprawling implementation changes and UI implications that need alignment.

Apr 08 '22 13:04 jmarantz

actually I'm going to close this again. Having worked hard on making filtering relatively fast, I think the flat view is probably OK. What's really needed more (based on recent experience) is to be able to sort by recency or frequency of changes, which can be done in the browser by polling json.

I'll start a new PR for that.

Fortunately the support for this will all be buried in the admin-html conditional compilation and will not affect envoy-mobile.

Feb 15 '23 15:02 jmarantz

envoy envoy copied to clipboard

RFC admin: hierarchical stats viewer

envoy
envoy copied to clipboard