vitess icon indicating copy to clipboard operation
vitess copied to clipboard

Add sql text counts stats to `vtcombo`,`vtgate`+`vttablet`

Open timvaillancourt opened this issue 1 year ago • 8 comments

Description

This PR adds stats for the size of query text processed by vtgate and vttablet

This is useful to understand large queries that potentially cause performance, grpc message size, bandwidth problems

Related Issue(s)

Resolves https://github.com/vitessio/vitess/issues/15929

Checklist

  • [x] "Backport to:" labels have been added if this change should be back-ported to release branches
  • [x] If this change is to be back-ported to previous releases, a justification is included in the PR description
  • [x] Tests were added or are not required
  • [ ] Did the new or modified tests pass consistently locally and on CI?
  • [ ] Documentation was added or is not required

Deployment Notes

timvaillancourt avatar May 09 '24 01:05 timvaillancourt

Review Checklist

Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.

General

  • [ ] Ensure that the Pull Request has a descriptive title.
  • [ ] Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • [ ] Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • [ ] Apply the release notes (needs details) label if users need to know about this change.
  • [ ] New features should be documented.
  • [ ] There should be some code comments as to why things are implemented the way they are.
  • [ ] There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • [ ] Is this flag really necessary?
  • [ ] Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • [ ] Each item in Jobs should be named in order to mark it as required.
  • [ ] If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • [ ] Protobuf changes should be wire-compatible.
  • [ ] Changes to _vt tables and RPCs need to be backward compatible.
  • [ ] RPC changes should be compatible with vitess-operator
  • [ ] If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • [ ] vtctl command output order should be stable and awk-able.

vitess-bot[bot] avatar May 09 '24 01:05 vitess-bot[bot]

Opened this one with a lot of failing CI to see if others have ideas how to debug

The failures are confusing and feel unrelated to this change 🤔

timvaillancourt avatar May 13 '24 14:05 timvaillancourt

Opened this one with a lot of failing CI to see if others have ideas how to debug

The failures are confusing and feel unrelated to this change 🤔

There failures are related to the changes made. I have pushed a commit that fixes some of those. You should run the test locally and fix the remaining if any.

harshit-gangal avatar May 14 '24 08:05 harshit-gangal

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.22%. Comparing base (0353ad4) to head (0531ce1). Report is 59 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15897      +/-   ##
==========================================
- Coverage   68.45%   68.22%   -0.24%     
==========================================
  Files        1559     1541      -18     
  Lines      196825   197330     +505     
==========================================
- Hits       134736   134627     -109     
- Misses      62089    62703     +614     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 14 '24 08:05 codecov[bot]

Opened this one with a lot of failing CI to see if others have ideas how to debug The failures are confusing and feel unrelated to this change 🤔

There failures are related to the changes made. I have pushed a commit that fixes some of those.

@harshit-gangal indeed it is, thanks! 🤦 🙇

You should run the test locally and fix the remaining if any.

Good point, I've made a bad habit of relying on GitHub Actions CI because many go test CIs fail on my MacBook when they rely on external things (perhaps not this test). I'll give running CI locally another shot 👀

Somehow this was not obvious from the CI outputs I looked at, including this Unit test run that ends in just:

+ go-junit-report -set-exit-code
make: *** [Makefile:209: unit_test] Error 1
Error: Process completed with exit code 1.

Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative

timvaillancourt avatar May 14 '24 11:05 timvaillancourt

Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative

The first step records the output, if you click on the step below in the CI you will see the output.

harshit-gangal avatar May 17 '24 15:05 harshit-gangal

Screenshot 2024-05-17 at 8 54 54 PM

harshit-gangal avatar May 17 '24 15:05 harshit-gangal

Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative

The first step records the output, if you click on the step below in the CI you will see the output.

@harshit-gangal ahh, that's where it goes! Thanks a lot ~~more on this one soon~~

timvaillancourt avatar May 17 '24 21:05 timvaillancourt

I removed the website docs label. We don't actually have a page that lists all available metrics, so there's no good place to add docs for new metrics.

deepthi avatar May 31 '24 16:05 deepthi

Since the PR has two approvals I am adding the Do Not Merge label so we don't merge this by accident before @deepthi's concern is answered.

frouioui avatar May 31 '24 21:05 frouioui

@deepthi / @frouioui, here are the metric outputs:

tvaillancourt@REDACTED:~$ curl -s localhost:15000/debug/vars | jq .QuerySQLTextCounts
{
  "Join.Select.": 3234,
  "dual.Select.": 3615
}
tvaillancourt@REDACTED:~$ curl -s localhost:15000/metrics 2>&1 | grep vttablet_query_sql_text_counts
# HELP vttablet_query_sql_text_counts query sql text counts
# TYPE vttablet_query_sql_text_counts counter
vttablet_query_sql_text_counts{plan="Select",table="Join",workload=""} 3234
vttablet_query_sql_text_counts{plan="Select",table="dual",workload=""} 3650

So looks good 👍. While doing that I thought of two potential alternatives in case this didn't work: QueryTextProcessed or QueryTextRead, any thoughts?

timvaillancourt avatar Jun 01 '24 15:06 timvaillancourt

Somewhat repeating @frouioui , will merge once @deepthi 's concern is addressed (likely you need o change ...SQL... to ...Sql... and will then merge.

@shlomi-noach the potential issue of s_q_l that Deepthi pointed out didn't seem to be a problem in practice, here is the metric output:

tvaillancourt@REDACTED:~$ curl -s localhost:15000/metrics 2>&1 | grep vttablet_query_sql_text_counts
# HELP vttablet_query_sql_text_counts query sql text counts
# TYPE vttablet_query_sql_text_counts counter
vttablet_query_sql_text_counts{plan="Select",table="Join",workload=""} 3234
vttablet_query_sql_text_counts{plan="Select",table="dual",workload=""} 3650

Let me know if any change is needed. I'll add a changelog/ note once we agree on the metric name and I think we're done! 🙇

timvaillancourt avatar Jun 03 '24 14:06 timvaillancourt

@shlomi-noach please feel free to dismiss my review if @timvaillancourt is able complete the requested changes before code freeze.

deepthi avatar Jun 03 '24 15:06 deepthi

OK, now I understand the intent of this metric, it won't work as implemented. Counters should only be used for monotonically increasing numbers, prometheus will not let you reduce the value of a "counter".

@deepthi the intent of this metric is a counter that increments forever so we can infer the rate of query text processed at any given time. I don't believe there is a need to ever reduce this metric, it only increases from the time the daemon starts like other counters

EDIT: here we can see the counter is only incremented by 0 or more: vtg.sqlTextCounts.Add(statsKey, int64(len(sql)))

timvaillancourt avatar Jun 03 '24 15:06 timvaillancourt

OK, now I understand the intent of this metric, it won't work as implemented. Counters should only be used for monotonically increasing numbers, prometheus will not let you reduce the value of a "counter".

@deepthi the intent of this metric is a counter that increments forever so we can infer the rate of query text processed at any given time. I don't believe there is a need to ever reduce this metric, it only increases from the time the daemon starts like other counters

EDIT: here we can see the counter is only incremented by 0 or more: vtg.sqlTextCounts.Add(statsKey, int64(len(sql)))

I think the term count is misleading in this context. Count is typically how many of an entity we encountered, so for example total number of queries. Here we are adding the length of the query each time. A counter will work for this, but I do request that the name be changed to avoid this kind of confusion.

deepthi avatar Jun 03 '24 15:06 deepthi

@timvaillancourt to summarize the current state: the changes make sense, and the metric looks correct. But if you could please rename it to something such as QueryProcessedTextLengths? And with that we can merge the PR.

shlomi-noach avatar Jun 03 '24 16:06 shlomi-noach

But if you could please rename it to something such as QueryProcessedTextLengths

@shlomi-noach / @deepthi I felt Lengths is a bit awkward so I've gone with QueryTextCharactersProcessed which I think is long but very clear. I've also added this to the 20.0.0 changelog

Rerequested your review 🙇

timvaillancourt avatar Jun 03 '24 22:06 timvaillancourt

Some minor wording changes to release notes are required to avoid user confusion. Rest LGTM. @shlomi-noach you may dismiss my review once these changes are done.

deepthi avatar Jun 03 '24 23:06 deepthi