percona-toolkit
percona-toolkit copied to clipboard
PT-2080 Add JSON logging to pt-kill
Problem
Currently, pt-kill
provides two logging options:
- Plain text output to stdout/file (source):
# 2009-07-15T15:04:01 KILL 8 (Query 42 sec) SELECT * FROM huge_table
- To a table in a central database (source):
CREATE TABLE kill_log ( kill_id int(10) unsigned NOT NULL AUTO_INCREMENT, server_id bigint(4) NOT NULL DEFAULT '0', timestamp DATETIME, reason TEXT, kill_error TEXT, Id bigint(4) NOT NULL DEFAULT '0', User varchar(16) NOT NULL DEFAULT '', Host varchar(64) NOT NULL DEFAULT '', db varchar(64) DEFAULT NULL, Command varchar(16) NOT NULL DEFAULT '', Time int(7) NOT NULL DEFAULT '0', State varchar(64) DEFAULT NULL, Info longtext, Time_ms bigint(21) DEFAULT '0', # NOTE, TODO: currently not used PRIMARY KEY (kill_id) ) DEFAULT CHARSET=utf8
Both of these options have drawbacks:
- The plain text output includes very limited information
- The
kill_log
table can easily grow to many TBs in size when operating a fleet of MySQL databases at scale that feed into it, resulting in an otherwise fairly simple log table requiring increasing amounts of effort to keep it scaling.
At a certain point the scaling issues with the kill_log
option make it preferable to switch to a logging solution that can handle ingesting large volumes of data from log files on many hosts instead - however the limited data in the plain text output is not always enough to debug why queries were killed.
The JIRA issue for this is:
- https://jira.percona.com/browse/PT-2080
Proposed Solution
I am proposing to add the ability to log killed queries in JSON format to a file on disk. This file can then be read by log collection agents to send the data to a central logging service.
This PR adds two new parameters to pt-kill
:
-
--json
- out killed queries in JSON format, must be used in conjunction with--print
(replaces the usual plain-text output of--print
with JSON output) -
--json-fields key1:value1,key2:value2
- append specified key:value pairs to the JSON output for each killed query, can only be used when--json
is also specified
There are no new dependencies introduced by this PR; support for generating JSON already existed in the dependencies bundled with pt-kill
.
--json
The JSON output when using --json
is in the following format (excluding newlines/leading whitespaces):
{
"Command": "Query",
"Host": "10.0.0.1:9999",
"Id": 1234,
"Info": "SELECT SLEEP(5)",
"State": "User sleep",
"Time": 10,
"User": "my_user",
"Db": "my_db",
"Kill_Error": "",
"Digest": "58A43A7DA83F58C1",
"Reason": "Exceeds busy time\nQuery matches Command spec",
"Timestamp": "2022-01-01T00:00:00"
}
This matches the fields in the kill_log
table, but excludes the following fields:
-
kill_id
- because it's an auto_increment column and not strictly necessary -
server_id
- as not easily accessible when writing JSON without querying database to retrieve it -
Time_ms
- as it is not currently used
It also contains an additional field:
-
Digest
- checksum of the query with?
in place of values (same as--query-id
output)
--json-fields
The value specified for the --json-fields
parameter should be one or more comma-separated key:value pairs (e.g. --json-fields hostname:$(hostname),tag:my_tag
). These will be appended to the JSON output for each killed query:
{
"Command": "Query",
"Host": "10.0.0.1:9999",
"Id": 1234,
"Info": "SELECT SLEEP(5)",
"State": "User sleep",
"Time": 10,
"User": "my_user",
"Db": "my_db",
"Kill_Error": "",
"Digest": "58A43A7DA83F58C1",
"Reason": "Exceeds busy time\nQuery matches Command spec",
"Timestamp": "2022-01-01T00:00:00",
"hostname": "my_host",
"tag": "my_tag"
}
Limitations
- New lines in killed queries are shown in the JSON output as string
\n
instead of actual new line characters.
Hi. We really appreciate the contribution and I personally think this is very useful. Could you please check the tests? Some of them are failing. If you need assistance to start the sandbox and run the tests, just drop me a message. Regards
Tests in the 3.x branch
ok
All tests successful.
Files=10, Tests=80, 73 wallclock secs ( 0.03 usr 0.02 sys + 3.26 cusr 1.03 csys = 4.34 CPU)
Result: PASS
Tests in this branch
prove -vw t/pt-kill/*
t/pt-kill/basics.t ............
1..6
ok 1 - There were 2 to 5 captures
ok 2 - Approximately 9 or 10 captures with --iterations 0
ok 3 - --verbose
ok 4 - kill-mysql-process says it reconnected
ok 5 - pt-kill still checked the processlist 4 times
ok 6 - Sandbox servers
ok
t/pt-kill/execute_command.t ...
1..9
ok 1 - No output without --print
ok 2 - --execute-command
ok 3 - --print with --execute-command
ok 4 - --execute-command (online)
not ok 5 - Executed zombie command
# Failed test 'Executed zombie command'
# at t/pt-kill/execute_command.t line 86.
# ''
# doesn't match '(?^:Executed echo zombie)'
ok 6 - No zombies
ok 7 - pt-kill stopped
ok 8 - No zombies
ok 9 - Sandbox servers
# Looks like you failed 1 test of 9.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/9 subtests
t/pt-kill/filter_plugin.t .....
1..1
ok 1 - basic --filter function works
ok
t/pt-kill/group_by.t ..........
1..9
ok 1 - Kill all but oldest
ok 2 - Kill all but oldest, matching specific user
ok 3 - Any busy time doesn't match
ok 4 - Any busy time matches
ok 5 - Each busy time doesn't match
ok 6 - Each busy time matches
ok 7 - Query count doesn't match
ok 8 - Queries don't match unless comments are stripped
ok 9 - --filter and custom --group-by
ok
t/pt-kill/group_queries.t .....
1..3
ok 1 - Group by Info
ok 2 - Group by Info with similar fingerprints
ok 3 - Group by fingerprint
ok
t/pt-kill/kill.t ..............
ok 1 - Got proc id of sleeping query
ok 2 - --kill
ok 3 - Killed connection
ok 4 - Got proc id of sleeping query
ok 5 - --kill-query
ok 6 - Killed query, not connection
ok 7 - Connection is still alive
ok 8 - Sandbox servers
1..8
ok
t/pt-kill/log_dsn.t ...........
ok 1 - --log-dsn croaks if missing D and t
ok 2 - --log-dsn croaks if missing D
ok 3 - --log-dsn croaks if missing t
not ok 4 - --log-dsn with existing log table, no error
# Failed test '--log-dsn with existing log table, no error'
# at t/pt-kill/log_dsn.t line 106.
# got: 'Use of uninitialized value $query in pattern match (m//) at /home/karl/go/src/github.com/percona/percona-toolkit/bin/pt-kill line 4897.
# '
# expected: ''
not ok 5 - ... which contains one row
# Failed test '... which contains one row'
# at t/pt-kill/log_dsn.t line 134.
# got: '0'
# expected: '1'
# $VAR1 = [];
not ok 6 - ... got the target db
# Failed test '... got the target db'
# at t/pt-kill/log_dsn.t line 142.
# got: undef
# expected: 'x1721659'
# $VAR1 = [
# {}
# ];
not ok 7 - ... on the correct server
# Failed test '... on the correct server'
# at t/pt-kill/log_dsn.t line 148.
# got: undef
# expected: '12345'
# $VAR1 = [
# {}
# ];
not ok 8 - ... correct kill reason
# Failed test '... correct kill reason'
# at t/pt-kill/log_dsn.t line 154.
# got: undef
# expected: 'Query matches db spec'
# $VAR1 = [
# {}
# ];
not ok 9 - ... timestamp is correct (bug 1086259)
# Failed test '... timestamp is correct (bug 1086259)'
# at t/pt-kill/log_dsn.t line 167.
# undef
# doesn't match '(?^:\A2022\-08\-27\ 19\:.{5}\Z)'
Use of uninitialized value $trimmed_result{"host"} in substitution (s///) at t/pt-kill/log_dsn.t line 181.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 2 just after 9.
Dubious, test returned 2 (wstat 512, 0x200)
Failed 6/9 subtests
t/pt-kill/match.t .............
1..16
ok 1 - --match-info
ok 2 - Ignore State=Locked by default
ok 3 - Can override default ignore State=Locked with --ignore-state ''
ok 4 - --match-state
ok 5 - --victims all
ok 6 - --ignore-user
ok 7 - --busy-time
ok 8 - --busy-time but no query is busy enough
Use of uninitialized value $query in pattern match (m//) at /home/karl/go/src/github.com/percona/percona-toolkit/bin/pt-kill line 4897, <$fh> line 1.
not ok 9 - --idle-time
# Failed test '--idle-time'
# at t/pt-kill/match.t line 102.
# ''
# doesn't match '(?^:KILL 29392005 \(Sleep 17 sec\) NULL)'
ok 10 - --match-state Locked --ignore-state '' --busy-time 5
ok 11 - --match-all except ignored
ok 12 - --match-all
ok 13 - --query-id
ok 14 - Doesn't match replication threads by default
Use of uninitialized value $query in pattern match (m//) at /home/karl/go/src/github.com/percona/percona-toolkit/bin/pt-kill line 4897.
not ok 15 - --replication-threads allows matching replication thread
# Failed test '--replication-threads allows matching replication thread'
# at t/pt-kill/match.t line 174.
# ''
# doesn't match '(?^:KILL (?:9|10))'
ERROR: Databases are left on master: kill_test, x1721659
ERROR: Databases are left on slave1: kill_test, x1721659
ERROR: Databases are left on slave2: kill_test, x1721659
not ok 16 - Sandbox servers
# Failed test 'Sandbox servers'
# at t/pt-kill/match.t line 186.
Bailout called. Further testing stopped: t/pt-kill/match.t broke the sandbox
Bail out! t/pt-kill/match.t broke the sandbox
FAILED--Further testing stopped: t/pt-kill/match.t broke the sandbox
:wave: Hi @percona-csalguero, thanks for taking a look 🙇 Apologies for the broken tests, I had completely missed the sandbox testing docs when I created the PR!
I've fixed the tests, and made a few other changes to the PR:
-
--json
parameter must now be used in conjunction with--print
; JSON output replaces normal--print
output when--json
is also specified -
--json
output is now sent to STDOUT instead of a specified file, as its new behaviour is to change the output format of--print
instead of logging to a file - Renamed
--log-json
and--log-json-fields
parameters to--json
and--json-fields
to better represent their new behaviour - Refactored checksum code to prevent test failures when
$query->{'Info'}
is empty - Added test for
--json
and--json-fields
parameters - Fixed bug where specifying
--query-id
would cause errors when$query->{'Info'}
is empty
The reason for the changes covered by 1, 2, and 3 above is that our logging system is set up to pull log data from systemd services, and we rely systemd reading STDOUT for each services to gather the log data. The upside of this change means that people using --json
will now have a choice of either capturing the log data from STDOUT, or they can pipe all of the pt-kill
output to a log file to get the JSON output that way.
:wave: @percona-csalguero it looks like your review comments are for a previous commit (I think from before the changes I made in ceb8d97?) as the sandbox tests are all passing now 😄
wave @percona-csalguero it looks like your review comments are for a previous commit (I think from before the changes I made in ceb8d97?) as the sandbox tests are all passing now smile
Hi, Thanks for fixing the tests. I already approved the latest changes.
Regards
Thank you! 🙇 Do you know when it is likely to be merged, and when there will be a new release? 🙏