crmsh icon indicating copy to clipboard operation
crmsh copied to clipboard

Refactor hb report

Open liangxin1300 opened this issue 6 years ago • 1 comments

Overview

Since customer complain the performance and to fix existing bugs, I'm pleased to create this PR, in general, changes include:

  • More faster
  • More stronger (100% unit test coverage; new added functional test)
  • -b option
  • Leveled debug messages
  • Existing bugs fix
  • Updated help and man page

RPM link

If you have interests, here are rpm links: For 12sp4-15sp1 For Tumbleweed Any suggestions and feedback are welcome!

Changes

More faster

  • Original
time hb_report 
real	0m11.038s
user	0m4.669s
sys	0m1.095s
  • New
time hb_report 
real	0m4.551s
user	0m2.662s
sys	0m0.533s

More stronger

More than 4200+ lines in this PR are for testing

Unit test: 100% test coverage (original percentage was very low)

/app/hb_report/__init__.py                                   0      0   100%
/app/hb_report/collect.py                                  343      0   100%
/app/hb_report/const.py                                     34      0   100%
/app/hb_report/core.py                                     588      0   100%
/app/hb_report/utils.py                                    248      0   100%

Functional test(new added):

hb_report_normal, include scenarios: - Run hb_report on empty environment - Verify log file filter by time span - Verify hb_report options

hb_report_bugs - Bug regression, include bsc#1148874, 1148873, 1067456, 1077553, 1106052, 1135696, 1130715, 1163581

-b option

How long time in the past, before now ([1-9][0-9]*[YmdHM]) Examples:

  # collect from 1 mounth ago
  hb_report -b 1m
  # collect from 12 days ago
  hb_report -b 12d
  # collect from 75 hours ago
  hb_report -b 75H
  # collect from 10 minutes ago
  hb_report -b 10M

Leveled debug messages

-v: show basic debug messages indicate what files have been collected and where they are -vv: besides above, show internal debug messages

# show basic debug messages
hb_report -v
DEBUG: node1#Collector: Dump ha-log.txt into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump journal.log into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump /etc/drbd.conf into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump /etc/drbd.d into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump /etc/booth/booth.conf into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump permissions info into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump time info into hb_report-Fri-28-Feb-2020/node1/time.txt
DEBUG: node1#Collector: Dump logset ['/var/log/ha-cluster-bootstrap.log'] into hb_report-Fri-28-Feb-2020/node1/ha-cluster-bootstrap.log
DEBUG: node1#Collector: Dump events file into hb_report-Fri-28-Feb-2020/node1/events.txt
DEBUG: node1#Collector: Dump DLM info into hb_report-Fri-28-Feb-2020/node1/dlm_dump.txt
DEBUG: node1#Collector: Dump SBD config into hb_report-Fri-28-Feb-2020/node1/sbd
DEBUG: node1#Collector: Dump logset ['/var/log/pacemaker/pacemaker.log'] into hb_report-Fri-28-Feb-2020/node1/pacemaker.log
DEBUG: node1#Collector: Dump OCFS2 info into hb_report-Fri-28-Feb-2020/node1/ocfs2.txt
DEBUG: node1#Collector: Dump system stats into hb_report-Fri-28-Feb-2020/node1/sysstats.txt
DEBUG: node1#Collector: Dump corosync config into hb_report-Fri-28-Feb-2020/node1/corosync.conf
DEBUG: node1#Collector: Dump crm_mon output into hb_report-Fri-28-Feb-2020/node1/crm_mon.txt
DEBUG: node1#Collector: Dump cib xml into hb_report-Fri-28-Feb-2020/node1/cib.xml
DEBUG: node1#Collector: Dump members of this partition into hb_report-Fri-28-Feb-2020/node1/members.txt
DEBUG: node1#Collector: Cluster service is running, touch "RUNNING" file at hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump cib config into hb_report-Fri-28-Feb-2020/node1/cib.txt
DEBUG: node1#Collector: Dump packages version and system info into hb_report-Fri-28-Feb-2020/node1/sysinfo.txt

Bug fix

Replace sensitive info in collected report (bsc#1163581)

By default, hb_report wouldn't replace sensitive words, just giving warnings, unless user use "-s" option. And by default, content in cib, pe and pacemaker.log which contains "passw.*" could be considered as sensitive data. Use option "-p" to add more patterns to match sensitive info.

Collect corosync's log correctly (bsc#1148874)

Try to get the value of 'logfile' from corosync.conf, if it exists and in hb_report time span, collect it.

Collect /var/log/messages correctly (bsc#1148873)

If this file exists and in time span, collect it.

Updated help info and man page

hb_report -h
usage: hb_report [options] [dest]

positional arguments:
  dest            Report name (may include path where to store the report)

optional arguments:
  -h, --help      Show this help message and exit
  -f time         Time to start from (default: 12 hours before)
  -t time         Time to finish at (default: now)
  -b time         How long time in the past, before now ([1-9][0-9]*[YmdHM])
  -d              Don't compress, but leave result in a directory
  -n node         Node names for this cluster; this option is additive (use -n a
                  -n b or -n "a b"); if you run report on the loghost or use
                  autojoin, it is highly recommended to set this option
  -u user         SSH user to access other nodes
  -X ssh-options  Extra ssh(1) options (default: StrictHostKeyChecking=no
                  EscapeChar=none ConnectTimeout=15); this option is additive
                  (use -X opt1 -X opt2 or -X "opt1 opt2")
  -E file         Extra logs to collect (default: /var/log/messages,
                  /var/log/ha-cluster-bootstrap.log); this option is additive
                  (use -E file1 -E file2 or -E "file1 file2")
  -s              Replace sensitive info in PE or CIB or pacemaker log files
  -p patt         Regular expression to match variables containing sensitive
                  data (default: passw.*); this option is additive (use -p patt1
                  -p patt2 or -p "patt1 patt2")
  -L patt         Regular expression to match in log files for analysis
                  (default: CRIT:, ERROR:, error:, warning:, crit:); this option
                  is additive (use -L patt1 -L patt2 or -L "patt1 patt2")
  -Q              The quick mode, which skips producing dot files from PE
                  inputs, verifying installed cluster stack rpms and sanitizing
                  files for sensitive information
  -M              Don't collect extra logs, opposite option of -E
  -Z              If destination directories exist, remove them instead of
                  exiting
  -S              Single node operation; don't try to start report collectors on
                  other nodes
  -v              Increase verbosity

Other changes

Exclusive options

-f and -b, -t and -b, -n and -S, -E and -M, -s and -Q, these pair of options are exclusive, not allowed to use at the same time.

Dropped options

  • -l option originally used to specify the ha log; New hb_report module will collect pacemaker, corosync and sbd logs from journal log directly, and find pacemaker log from /etc/sysconfig/pacemaker, find corosync log from /etc/corosync/corosync.conf, so, no need to leave an option to specific an HA log anymore, user can specific any file or log they want to include by using option -E.
  • -e option originally choose editor to edit report description. I don't think this option is useful and worth to maintain, customer send tar ball of report, that's enough
  • -D option originally means don't invoke editor to write description; This is opposite option of -e, so, drop it
  • -A option originally is for OpenAIS cluster; Outdated, obviously

For reviewer

Firstly, many thanks for helping review! This PR is huge, so let me introduce related commits and what purpose of each source code files.

Commits

  • Low: hb_report: Refactor hb_report module Source code of hb_report module
  • Dev: behave: Functional test for hb_report module Functional test code using python-behave
  • Dev: unittest: unit test for hb_report module Unit test code
  • Dev: doc: Update doc for hb_report module Include help info and man page

Source files

  • hb_report/collect.py Includes functions for collecting many kinds of logs and information
  • hb_report/const.py Definitions of const
  • hb_report/core.py Includes functions which are used for major work flow
  • hb_report/hb_report Enter file
  • hb_report/utils.py Includes utils and tools used by the whole module

Related changes on crmsh's part

  • Low: utils: don't convert bytes to ASCII if needed In some scenarios like transfer data internally between hb_report collecter to master process, there is no need to convert bytes immediately.
  • Low: msg: make info/warning/debug message to stdout Originally, all crmsh messages dump to stderr, I think it's better to split them, info/warning/debug messages to stdout, while error/fatal messages to stderr. Because this commit change the content of output, testcases should make adjustment, so there is commit c5949110
  • Low: tmpfiles: add time option to change file time attributes New hb_report use crmsh.tmpfiles to create and manage temp files; Add time option to change time attributes(access/modify time) of temp files, that makes this function more useful and efficient.
  • Low: utils: add quiet option to disable error output Add quiet option to crmsh.utils.parse_time and crmsh.utils.parse_to_timestamp, disable exception output, just return None
  • Low: corosync: get value return None when corosync.conf not exist A lot of places calling functions like corosync.get_*, it's more safe to check whether the corosync.conf does exists, otherwise, it will cause exception
  • Low: config: remove pacemaker.log from collect_extra_logs Pacemaker's log is important log, should not be regarded as extra logs, otherwise, if user use -M option, pacemaker's log will not be collected. New hb_report module will collect pacemaker's log from journal log directly, and find pacemaker log from /etc/sysconfig/pacemaker

liangxin1300 avatar Mar 28 '20 06:03 liangxin1300

You may want to add bsc#1176441 into the bug list for tracking too.

zzhou1 avatar Sep 29 '20 03:09 zzhou1