percona-toolkit icon indicating copy to clipboard operation
percona-toolkit copied to clipboard

PT-2080 Add JSON logging to pt-kill

Open dm-2 opened this issue 2 years ago • 6 comments

Problem

Currently, pt-kill provides two logging options:

  1. Plain text output to stdout/file (source):
    # 2009-07-15T15:04:01 KILL 8 (Query 42 sec) SELECT * FROM huge_table
    
  2. To a table in a central database (source):
    CREATE TABLE kill_log (
        kill_id     int(10) unsigned NOT NULL AUTO_INCREMENT,
        server_id   bigint(4) NOT NULL DEFAULT '0',
        timestamp   DATETIME,
        reason      TEXT,
        kill_error  TEXT,
        Id          bigint(4) NOT NULL DEFAULT '0',
        User        varchar(16) NOT NULL DEFAULT '',
        Host        varchar(64) NOT NULL DEFAULT '',
        db          varchar(64) DEFAULT NULL,
        Command     varchar(16) NOT NULL DEFAULT '',
        Time        int(7) NOT NULL DEFAULT '0',
        State       varchar(64) DEFAULT NULL,
        Info        longtext,
        Time_ms     bigint(21) DEFAULT '0', # NOTE, TODO: currently not used
        PRIMARY KEY (kill_id)
    ) DEFAULT CHARSET=utf8
    

Both of these options have drawbacks:

  1. The plain text output includes very limited information
  2. The kill_log table can easily grow to many TBs in size when operating a fleet of MySQL databases at scale that feed into it, resulting in an otherwise fairly simple log table requiring increasing amounts of effort to keep it scaling.

At a certain point the scaling issues with the kill_log option make it preferable to switch to a logging solution that can handle ingesting large volumes of data from log files on many hosts instead - however the limited data in the plain text output is not always enough to debug why queries were killed.

The JIRA issue for this is:

  • https://jira.percona.com/browse/PT-2080

Proposed Solution

I am proposing to add the ability to log killed queries in JSON format to a file on disk. This file can then be read by log collection agents to send the data to a central logging service.

This PR adds two new parameters to pt-kill:

  • --json - out killed queries in JSON format, must be used in conjunction with --print (replaces the usual plain-text output of --print with JSON output)
  • --json-fields key1:value1,key2:value2 - append specified key:value pairs to the JSON output for each killed query, can only be used when --json is also specified

There are no new dependencies introduced by this PR; support for generating JSON already existed in the dependencies bundled with pt-kill.

--json

The JSON output when using --json is in the following format (excluding newlines/leading whitespaces):

   {
      "Command": "Query",
      "Host": "10.0.0.1:9999",
      "Id": 1234,
      "Info": "SELECT SLEEP(5)",
      "State": "User sleep",
      "Time": 10,
      "User": "my_user",
      "Db": "my_db",
      "Kill_Error": "",
      "Digest": "58A43A7DA83F58C1",
      "Reason": "Exceeds busy time\nQuery matches Command spec",
      "Timestamp": "2022-01-01T00:00:00"
   }

This matches the fields in the kill_log table, but excludes the following fields:

  • kill_id - because it's an auto_increment column and not strictly necessary
  • server_id - as not easily accessible when writing JSON without querying database to retrieve it
  • Time_ms - as it is not currently used

It also contains an additional field:

  • Digest - checksum of the query with ? in place of values (same as --query-id output)

--json-fields

The value specified for the --json-fields parameter should be one or more comma-separated key:value pairs (e.g. --json-fields hostname:$(hostname),tag:my_tag). These will be appended to the JSON output for each killed query:

   {
      "Command": "Query",
      "Host": "10.0.0.1:9999",
      "Id": 1234,
      "Info": "SELECT SLEEP(5)",
      "State": "User sleep",
      "Time": 10,
      "User": "my_user",
      "Db": "my_db",
      "Kill_Error": "",
      "Digest": "58A43A7DA83F58C1",
      "Reason": "Exceeds busy time\nQuery matches Command spec",
      "Timestamp": "2022-01-01T00:00:00",
      "hostname": "my_host",
      "tag": "my_tag"
   }

Limitations

  • New lines in killed queries are shown in the JSON output as string \n instead of actual new line characters.

dm-2 avatar Jul 14 '22 13:07 dm-2

CLA assistant check
All committers have signed the CLA.

it-percona-cla avatar Jul 14 '22 13:07 it-percona-cla

Hi. We really appreciate the contribution and I personally think this is very useful. Could you please check the tests? Some of them are failing. If you need assistance to start the sandbox and run the tests, just drop me a message. Regards

Tests in the 3.x branch

ok
All tests successful.
Files=10, Tests=80, 73 wallclock secs ( 0.03 usr  0.02 sys +  3.26 cusr  1.03 csys =  4.34 CPU)
Result: PASS

Tests in this branch

prove -vw t/pt-kill/*
t/pt-kill/basics.t ............ 
1..6
ok 1 - There were 2 to 5 captures
ok 2 - Approximately 9 or 10 captures with --iterations 0
ok 3 - --verbose
ok 4 - kill-mysql-process says it reconnected
ok 5 - pt-kill still checked the processlist 4 times
ok 6 - Sandbox servers
ok
t/pt-kill/execute_command.t ... 
1..9
ok 1 - No output without --print
ok 2 - --execute-command
ok 3 - --print with --execute-command
ok 4 - --execute-command (online)
not ok 5 - Executed zombie command

#   Failed test 'Executed zombie command'
#   at t/pt-kill/execute_command.t line 86.
#                   ''
#     doesn't match '(?^:Executed echo zombie)'
ok 6 - No zombies
ok 7 - pt-kill stopped
ok 8 - No zombies
ok 9 - Sandbox servers
# Looks like you failed 1 test of 9.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/9 subtests 
t/pt-kill/filter_plugin.t ..... 
1..1
ok 1 - basic --filter function works
ok
t/pt-kill/group_by.t .......... 
1..9
ok 1 - Kill all but oldest
ok 2 - Kill all but oldest, matching specific user
ok 3 - Any busy time doesn't match
ok 4 - Any busy time matches
ok 5 - Each busy time doesn't match
ok 6 - Each busy time matches
ok 7 - Query count doesn't match
ok 8 - Queries don't match unless comments are stripped
ok 9 - --filter and custom --group-by
ok
t/pt-kill/group_queries.t ..... 
1..3
ok 1 - Group by Info
ok 2 - Group by Info with similar fingerprints
ok 3 - Group by fingerprint
ok
t/pt-kill/kill.t .............. 
ok 1 - Got proc id of sleeping query
ok 2 - --kill
ok 3 - Killed connection
ok 4 - Got proc id of sleeping query
ok 5 - --kill-query
ok 6 - Killed query, not connection
ok 7 - Connection is still alive
ok 8 - Sandbox servers
1..8
ok
t/pt-kill/log_dsn.t ........... 
ok 1 - --log-dsn croaks if missing D and t
ok 2 - --log-dsn croaks if missing D
ok 3 - --log-dsn croaks if missing t
not ok 4 - --log-dsn with existing log table, no error

#   Failed test '--log-dsn with existing log table, no error'
#   at t/pt-kill/log_dsn.t line 106.
#          got: 'Use of uninitialized value $query in pattern match (m//) at /home/karl/go/src/github.com/percona/percona-toolkit/bin/pt-kill line 4897.
# '
#     expected: ''
not ok 5 - ... which contains one row

#   Failed test '... which contains one row'
#   at t/pt-kill/log_dsn.t line 134.
#          got: '0'
#     expected: '1'
# $VAR1 = [];
not ok 6 - ... got the target db

#   Failed test '... got the target db'
#   at t/pt-kill/log_dsn.t line 142.
#          got: undef
#     expected: 'x1721659'
# $VAR1 = [
#   {}
# ];
not ok 7 - ... on the correct server

#   Failed test '... on the correct server'
#   at t/pt-kill/log_dsn.t line 148.
#          got: undef
#     expected: '12345'
# $VAR1 = [
#   {}
# ];
not ok 8 - ... correct kill reason

#   Failed test '... correct kill reason'
#   at t/pt-kill/log_dsn.t line 154.
#          got: undef
#     expected: 'Query matches db spec'
# $VAR1 = [
#   {}
# ];
not ok 9 - ... timestamp is correct (bug 1086259)

#   Failed test '... timestamp is correct (bug 1086259)'
#   at t/pt-kill/log_dsn.t line 167.
#                   undef
#     doesn't match '(?^:\A2022\-08\-27\ 19\:.{5}\Z)'
Use of uninitialized value $trimmed_result{"host"} in substitution (s///) at t/pt-kill/log_dsn.t line 181.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 2 just after 9.
Dubious, test returned 2 (wstat 512, 0x200)
Failed 6/9 subtests
t/pt-kill/match.t .............
1..16
ok 1 - --match-info
ok 2 - Ignore State=Locked by default
ok 3 - Can override default ignore State=Locked with --ignore-state ''
ok 4 - --match-state
ok 5 - --victims all
ok 6 - --ignore-user
ok 7 - --busy-time
ok 8 - --busy-time but no query is busy enough
Use of uninitialized value $query in pattern match (m//) at /home/karl/go/src/github.com/percona/percona-toolkit/bin/pt-kill line 4897, <$fh> line 1.
not ok 9 - --idle-time

#   Failed test '--idle-time'
#   at t/pt-kill/match.t line 102.
#                   ''
#     doesn't match '(?^:KILL 29392005 \(Sleep 17 sec\) NULL)'
ok 10 - --match-state Locked --ignore-state '' --busy-time 5
ok 11 - --match-all except ignored
ok 12 - --match-all
ok 13 - --query-id
ok 14 - Doesn't match replication threads by default
Use of uninitialized value $query in pattern match (m//) at /home/karl/go/src/github.com/percona/percona-toolkit/bin/pt-kill line 4897.
not ok 15 - --replication-threads allows matching replication thread

#   Failed test '--replication-threads allows matching replication thread'
#   at t/pt-kill/match.t line 174.
#                   ''
#     doesn't match '(?^:KILL (?:9|10))'
ERROR: Databases are left on master: kill_test, x1721659
ERROR: Databases are left on slave1: kill_test, x1721659
ERROR: Databases are left on slave2: kill_test, x1721659
not ok 16 - Sandbox servers

#   Failed test 'Sandbox servers'
#   at t/pt-kill/match.t line 186.
Bailout called.  Further testing stopped:  t/pt-kill/match.t broke the sandbox
Bail out!  t/pt-kill/match.t broke the sandbox
FAILED--Further testing stopped: t/pt-kill/match.t broke the sandbox

percona-csalguero avatar Aug 27 '22 23:08 percona-csalguero

:wave: Hi @percona-csalguero, thanks for taking a look 🙇 Apologies for the broken tests, I had completely missed the sandbox testing docs when I created the PR!

I've fixed the tests, and made a few other changes to the PR:

  1. --json parameter must now be used in conjunction with --print; JSON output replaces normal --print output when --json is also specified
  2. --json output is now sent to STDOUT instead of a specified file, as its new behaviour is to change the output format of --print instead of logging to a file
  3. Renamed --log-json and --log-json-fields parameters to --json and --json-fields to better represent their new behaviour
  4. Refactored checksum code to prevent test failures when $query->{'Info'} is empty
  5. Added test for --json and --json-fields parameters
  6. Fixed bug where specifying --query-id would cause errors when $query->{'Info'} is empty

The reason for the changes covered by 1, 2, and 3 above is that our logging system is set up to pull log data from systemd services, and we rely systemd reading STDOUT for each services to gather the log data. The upside of this change means that people using --json will now have a choice of either capturing the log data from STDOUT, or they can pipe all of the pt-kill output to a log file to get the JSON output that way.

dm-2 avatar Sep 02 '22 17:09 dm-2

:wave: @percona-csalguero it looks like your review comments are for a previous commit (I think from before the changes I made in ceb8d97?) as the sandbox tests are all passing now 😄

dm-2 avatar Sep 20 '22 15:09 dm-2

wave @percona-csalguero it looks like your review comments are for a previous commit (I think from before the changes I made in ceb8d97?) as the sandbox tests are all passing now smile

Hi, Thanks for fixing the tests. I already approved the latest changes.

Regards

percona-csalguero avatar Sep 22 '22 14:09 percona-csalguero

Thank you! 🙇 Do you know when it is likely to be merged, and when there will be a new release? 🙏

dm-2 avatar Sep 22 '22 15:09 dm-2