Refactor hb report
Overview
Since customer complain the performance and to fix existing bugs, I'm pleased to create this PR, in general, changes include:
- More faster
- More stronger (100% unit test coverage; new added functional test)
- -b option
- Leveled debug messages
- Existing bugs fix
- Updated help and man page
RPM link
If you have interests, here are rpm links: For 12sp4-15sp1 For Tumbleweed Any suggestions and feedback are welcome!
Changes
More faster
- Original
time hb_report
real 0m11.038s
user 0m4.669s
sys 0m1.095s
- New
time hb_report
real 0m4.551s
user 0m2.662s
sys 0m0.533s
More stronger
More than 4200+ lines in this PR are for testing
Unit test: 100% test coverage (original percentage was very low)
/app/hb_report/__init__.py 0 0 100%
/app/hb_report/collect.py 343 0 100%
/app/hb_report/const.py 34 0 100%
/app/hb_report/core.py 588 0 100%
/app/hb_report/utils.py 248 0 100%
Functional test(new added):
hb_report_normal, include scenarios: - Run hb_report on empty environment - Verify log file filter by time span - Verify hb_report options
hb_report_bugs - Bug regression, include bsc#1148874, 1148873, 1067456, 1077553, 1106052, 1135696, 1130715, 1163581
-b option
How long time in the past, before now ([1-9][0-9]*[YmdHM]) Examples:
# collect from 1 mounth ago
hb_report -b 1m
# collect from 12 days ago
hb_report -b 12d
# collect from 75 hours ago
hb_report -b 75H
# collect from 10 minutes ago
hb_report -b 10M
Leveled debug messages
-v: show basic debug messages indicate what files have been collected and where they are
-vv: besides above, show internal debug messages
# show basic debug messages
hb_report -v
DEBUG: node1#Collector: Dump ha-log.txt into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump journal.log into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump /etc/drbd.conf into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump /etc/drbd.d into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump /etc/booth/booth.conf into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump permissions info into hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump time info into hb_report-Fri-28-Feb-2020/node1/time.txt
DEBUG: node1#Collector: Dump logset ['/var/log/ha-cluster-bootstrap.log'] into hb_report-Fri-28-Feb-2020/node1/ha-cluster-bootstrap.log
DEBUG: node1#Collector: Dump events file into hb_report-Fri-28-Feb-2020/node1/events.txt
DEBUG: node1#Collector: Dump DLM info into hb_report-Fri-28-Feb-2020/node1/dlm_dump.txt
DEBUG: node1#Collector: Dump SBD config into hb_report-Fri-28-Feb-2020/node1/sbd
DEBUG: node1#Collector: Dump logset ['/var/log/pacemaker/pacemaker.log'] into hb_report-Fri-28-Feb-2020/node1/pacemaker.log
DEBUG: node1#Collector: Dump OCFS2 info into hb_report-Fri-28-Feb-2020/node1/ocfs2.txt
DEBUG: node1#Collector: Dump system stats into hb_report-Fri-28-Feb-2020/node1/sysstats.txt
DEBUG: node1#Collector: Dump corosync config into hb_report-Fri-28-Feb-2020/node1/corosync.conf
DEBUG: node1#Collector: Dump crm_mon output into hb_report-Fri-28-Feb-2020/node1/crm_mon.txt
DEBUG: node1#Collector: Dump cib xml into hb_report-Fri-28-Feb-2020/node1/cib.xml
DEBUG: node1#Collector: Dump members of this partition into hb_report-Fri-28-Feb-2020/node1/members.txt
DEBUG: node1#Collector: Cluster service is running, touch "RUNNING" file at hb_report-Fri-28-Feb-2020/node1
DEBUG: node1#Collector: Dump cib config into hb_report-Fri-28-Feb-2020/node1/cib.txt
DEBUG: node1#Collector: Dump packages version and system info into hb_report-Fri-28-Feb-2020/node1/sysinfo.txt
Bug fix
Replace sensitive info in collected report (bsc#1163581)
By default, hb_report wouldn't replace sensitive words, just giving warnings, unless user use "-s" option. And by default, content in cib, pe and pacemaker.log which contains "passw.*" could be considered as sensitive data. Use option "-p" to add more patterns to match sensitive info.
Collect corosync's log correctly (bsc#1148874)
Try to get the value of 'logfile' from corosync.conf, if it exists and in hb_report time span, collect it.
Collect /var/log/messages correctly (bsc#1148873)
If this file exists and in time span, collect it.
Updated help info and man page
hb_report -h
usage: hb_report [options] [dest]
positional arguments:
dest Report name (may include path where to store the report)
optional arguments:
-h, --help Show this help message and exit
-f time Time to start from (default: 12 hours before)
-t time Time to finish at (default: now)
-b time How long time in the past, before now ([1-9][0-9]*[YmdHM])
-d Don't compress, but leave result in a directory
-n node Node names for this cluster; this option is additive (use -n a
-n b or -n "a b"); if you run report on the loghost or use
autojoin, it is highly recommended to set this option
-u user SSH user to access other nodes
-X ssh-options Extra ssh(1) options (default: StrictHostKeyChecking=no
EscapeChar=none ConnectTimeout=15); this option is additive
(use -X opt1 -X opt2 or -X "opt1 opt2")
-E file Extra logs to collect (default: /var/log/messages,
/var/log/ha-cluster-bootstrap.log); this option is additive
(use -E file1 -E file2 or -E "file1 file2")
-s Replace sensitive info in PE or CIB or pacemaker log files
-p patt Regular expression to match variables containing sensitive
data (default: passw.*); this option is additive (use -p patt1
-p patt2 or -p "patt1 patt2")
-L patt Regular expression to match in log files for analysis
(default: CRIT:, ERROR:, error:, warning:, crit:); this option
is additive (use -L patt1 -L patt2 or -L "patt1 patt2")
-Q The quick mode, which skips producing dot files from PE
inputs, verifying installed cluster stack rpms and sanitizing
files for sensitive information
-M Don't collect extra logs, opposite option of -E
-Z If destination directories exist, remove them instead of
exiting
-S Single node operation; don't try to start report collectors on
other nodes
-v Increase verbosity
Other changes
Exclusive options
-f and -b, -t and -b, -n and -S, -E and -M, -s and -Q, these pair of options are exclusive, not allowed to use at the same time.
Dropped options
-
-loption originally used to specify the ha log; New hb_report module will collect pacemaker, corosync and sbd logs from journal log directly, and find pacemaker log from /etc/sysconfig/pacemaker, find corosync log from /etc/corosync/corosync.conf, so, no need to leave an option to specific an HA log anymore, user can specific any file or log they want to include by using option-E. -
-eoption originally choose editor to edit report description. I don't think this option is useful and worth to maintain, customer send tar ball of report, that's enough -
-Doption originally means don't invoke editor to write description; This is opposite option of-e, so, drop it -
-Aoption originally is for OpenAIS cluster; Outdated, obviously
For reviewer
Firstly, many thanks for helping review! This PR is huge, so let me introduce related commits and what purpose of each source code files.
Commits
- Low: hb_report: Refactor hb_report module Source code of hb_report module
- Dev: behave: Functional test for hb_report module Functional test code using python-behave
- Dev: unittest: unit test for hb_report module Unit test code
- Dev: doc: Update doc for hb_report module Include help info and man page
Source files
-
hb_report/collect.pyIncludes functions for collecting many kinds of logs and information -
hb_report/const.pyDefinitions of const -
hb_report/core.pyIncludes functions which are used for major work flow -
hb_report/hb_reportEnter file -
hb_report/utils.pyIncludes utils and tools used by the whole module
Related changes on crmsh's part
- Low: utils: don't convert bytes to ASCII if needed In some scenarios like transfer data internally between hb_report collecter to master process, there is no need to convert bytes immediately.
-
Low: msg: make info/warning/debug message to stdout
Originally, all crmsh messages dump to
stderr, I think it's better to split them, info/warning/debug messages to stdout, while error/fatal messages to stderr. Because this commit change the content of output, testcases should make adjustment, so there is commit c5949110 -
Low: tmpfiles: add time option to change file time attributes
New hb_report use crmsh.tmpfiles to create and manage temp files; Add
timeoption to change time attributes(access/modify time) of temp files, that makes this function more useful and efficient. -
Low: utils: add quiet option to disable error output
Add
quietoption to crmsh.utils.parse_time and crmsh.utils.parse_to_timestamp, disable exception output, just return None -
Low: corosync: get value return None when corosync.conf not exist
A lot of places calling functions like
corosync.get_*, it's more safe to check whether the corosync.conf does exists, otherwise, it will cause exception -
Low: config: remove pacemaker.log from collect_extra_logs
Pacemaker's log is important log, should not be regarded as
extralogs, otherwise, if user use-Moption, pacemaker's log will not be collected. New hb_report module will collect pacemaker's log from journal log directly, and find pacemaker log from /etc/sysconfig/pacemaker
You may want to add bsc#1176441 into the bug list for tracking too.