EPIC: Clean up Nagios so that we can use #infrastructure-bot
Currently working on the following tasks related to this issue:
Issue #2609 - PR #2638 Awaiting Review - 28/09/2022
- Create ansible playbook to install Nagios Server Core on Ubuntu 22.04 via Ansible
Issue #2650 - Completed ( 29/09/2022 )
2) Tidy Up Existing Nagios Processese, To Provide Base For Future Automation
Issue #2619 - Completed ( 27/06/2022 )
3) Upgrade O/S on Nagios Production Server To Ubuntu 22.04
Issue #2758 - Completed ( 29/09/2022 )
4) Align Nagios & Ansible Inventory Files, Fix Hosts In Nagios But Not Ansible
Issue #2759 - In Progress ( 29/09/2022 )
5) Align Nagios & Ansible Inventory Files, Fix Hosts In Ansible But Not Nagios
Issue #2629
6) Automate parsing of ansible inventory into Nagios inventory
Issue #2802 7) Investigate Adding Of Windows Hosts To Nagios
Issue ##### TBC #####
8) Automate Installation & Configuration Of Nagios Nodes ( Requires 2 & 4 )
8.1) Update server related global config files ( resource.cfg )
8.2) Update server related global config files ( nagios.cfg )
8.3) Create Commands.CFG file from source.
Issue ##### TBC #####
9) Automate Nagios Server Management
9.1) Manage User Configuration / Backup / Restore / Automate ? ( cgi.cfg / htpasswd.users etc)
- Additional Tasks Carried Out:- 10.1) Document process to update host group : https://github.com/adoptium/infrastructure/issues/2765 10.2) Document process to add additional host check : https://github.com/adoptium/infrastructure/issues/2767
Other tasks to be considered :
- Document standard tasks provided on base install
- Identify additional monitoring tasks, and automate adding them via ansible.
- Extend support to other O/Ss besides Ubuntu22.04
- Restoring Current Nagios Config / Cleaning Up - TBD
- Investigate Nagios User Security Model - TBD
We have a slack channel for Nagios notifications (#infrastructure-bot) but it's a bit noisy, mostly due to some things like NTP not being in a suitable state. At the moment the channel is receiving hundreds of notifications a day which prevents it being useful. We need to clean up this output (fix the problems, or choose not to publish the notifications) so that we can take remedial action in a timely manner to reduce the impact to the build and test processes. This epic will have multiple individual issues underneath it to cover things that can be tackled individually.
Follow on to #1229
I'd like to propose we change this EPIC from cleaning up Nagios specifically so we can use #infrastructure-bot , to cleaning up Nagios, in it's entirety - i.e. Upgrade the machine that Nagios is running on, audit the checks we do on the hosts, and ansibilise the method of adding hosts to Nagios.
Currently; Nagios 4.4.6 (the latest Nagios) is running on an older version of Ubuntu on Hetzner. With the recent look at upgrading our Jenkins instance, it may be a good idea to do the same with the Nagios machine.
In addition, Nagios was initially setup by Brad, and using a shell script that was called by Ansible, however no documentation was made about why certain checks have been put in. I attempted to clean it up last year and made a lot of issues for future enhancements, but it may be best to just completely re-set it up with; documentation for why we do each check, a proper Ansiblilised method of adding each host to Nagios (which, with AWX, would make keeping the hosted machine on Nagios much easier to keep updated), and very well documenting any manual setup, if required, for specific hosts (i.e. The check_SSL_cert check for the Jenkins server, would likely need to be put in manually).
Over the next little while, I'll do some local testing on my machine to ansibilise the Nagios Host setup process (i.e. replacing the script Brad made), to make a start.
Off the top of my head, here's an abstracted list of problems for the Epic. I've divided them by what can be done independently of one another:
- [ ] Provision an upgraded machine for Nagios.
- [ ] Teardown old Nagios machine.
- [ ] Review current checks in place for hosts.
- [ ] Document current and new checks for hosts. (i.e. Why we're checking them, what plugins/modules we're using to check them, anything special about them, what checks are relevant to what platforms, do the checks alert the Infrastructure-bot channel? etc.)
- [ ] Create Ansible roles to configure (i.e. install plugins) and add hosts to the Nagios Master.
Optional:
- [x] Create a playbook that will install Nagios onto the Nagios master node; for openness, and repeatability.
ref: #1716 #1718 #2039 #1876 #1670
(there may be more, to be honest)
Ref: https://github.com/Willsparker/AnsibleBoilerPlates/tree/main/Nagios I've created a playbook that can install Nagios-Core on at least Ubuntu 20.04 (I imagine it'd work with 21.04 too). It isn't perfect, but it'll help if we do a big new install of Nagios (and also makes it easy to create a test bed to help with #1670 ).
Currently working on the following tasks related to this issue:
Issue #2609 - Draft PR awaiting review ( 28/06/2022 )
- Create ansible playbook to install Nagios Server Core on Ubuntu 22.04 via Ansible
Issue #2650 2) Tidy Up Existing Nagios Processese, To Provide Base For Future Automation
Issue #2619 - Completed ( 27/06/2022 ) 3) Upgrade O/S on Nagios Production Server To Ubuntu 22.04
Issue #2629 4) Automate parsing of ansible inventory into Nagios inventory
Issue ##### TBC ##### 5) Automate Installation & Configuration Of Nagios Nodes ( Requires 2 & 4 ) 5.1) Update server related global config files ( resource.cfg ) 5.2) Update server related global config files ( nagios.cfg ) 5.3) Create Commands.CFG file from source.
Issue ##### TBC ##### 6) Automate Nagios Server Management 6.1) Manage User Configuration / Backup / Restore / Automate ? ( cgi.cfg / htpasswd.users etc)
Other tasks to be considered :
- Document standard tasks provided on base install
- Identify additional monitoring tasks, and automate adding them via ansible.
- Extend support to other O/Ss besides Ubuntu22.04
- Restoring Current Nagios Config / Cleaning Up - TBD
- Investigate Nagios User Security Model - TBD
Add checks for Vagrant host, specifically around number of network connections and zombie boxes.
From discussion today with @sej-jackson and @aixtools - feel free to add anything I've missed in subsequent comments as my notes didn't get saved so I'm now writing this from memory:
- Add monitoring of
/tmpand the file system used for the jenkins workspace (Was noted that if workspace gets full, jenkins will mark the machine offline which will be picked up as a warning space regardless) - Potential to replace some of the items from the Process Check job - in particular causing a warning alert if there is an
X -vfbprocess which is over 24 hours old which is running as the jenkins user. As a second stage, potentially anything other than the jenkins agent process should be treated as suspicious and generate a warning if there are no jobs currently in the process of being executed - Possibly use to check system logs or processes on the machine for any intrusion attempts
Other systems that could be monitored as part of the infrastructure servers or elsewhere (likely each need custom checks)
- TRSS
- AWX
- Jenkins itself
- Vagrant server (e.g. when it runs low on available network interface numbers to use!)
We also discussed whether it could be used to alert on the things currently monitored by https://status.adoptium.net/
And we also suggested that it might be worth having a special group for systems which are being used as hosts for our 'dockerstatic' containers used for tests to monitor their health, whether the container is up etc. as these are likely to form an increasing proportion of the machines which we use.
I think this can probably be closed now but I'll await @steelhead31's return and leave it in the October mileston.
This initial epic is now completed, the nagios server is now revitalised and much more useful ( and significantly less spammy )