manageiq
manageiq copied to clipboard
Memory usage spikes on appliance with 100+ providers
Issue Description
Memory usage periodically spikes until out of memory. Eventually something crashes and a bunch of workers are killed. Things start back up and the process repeats.
Memory used (%) since last reboot on 2024-01-31 until 2024-02-15:
Environment
- Appliance running in IBM Cloud VPC (16 vCPU | 160 GiB | 32 Gbps)
- VERSION=
4.2.0
(IM) - BUILD=
20230913103031_b15dc60
- Providers:
vmdb_production=# SELECT COUNT(*),type
vmdb_production-# FROM ext_management_systems
vmdb_production-# WHERE type LIKE '%::CloudManager' OR type LIKE '%::InfraManager'
vmdb_production-# GROUP BY type;
count | type
-------+------------------------------------------------------------------
57 | ManageIQ::Providers::IbmCloud::PowerVirtualServers::CloudManager
48 | ManageIQ::Providers::IbmCloud::VPC::CloudManager
1 | ManageIQ::Providers::IbmPowerHmc::InfraManager
3 | ManageIQ::Providers::IbmPowerVc::CloudManager
1 | ManageIQ::Providers::Openstack::CloudManager
1 | ManageIQ::Providers::Openstack::InfraManager
(6 rows)
Logs
I saved off journalctl entries for the following events. The log files are too large to upload to GitHub (also, they may contain sensitive information). I can provide access to them as-needed.
Memory usage spikes ~10% on 2024-02-08 11:10
journalctl save: journalctl_20240208105500-20240208130500.txt
Memory usage drop on 2024-02-14 00:10
journalctl save: journalctl_20240214000000-20240214003000.txt
Memory usage spikes ~10% 2024-02-15 12:55
journalctl save: journalctl_20240214125000-20240214130500.txt
top_output save: top_output_20240215.log
From the top output it looks like used stays flat while buff/cache increases:
$ grep -n "^MiB Mem" top_output_20240215.log
~~snip~~
488291:MiB Mem : 160910.6 total, 37859.1 free, 104039.2 used, 19012.3 buff/cache
489176:MiB Mem : 160910.6 total, 37666.9 free, 104048.2 used, 19195.5 buff/cache
490071:MiB Mem : 160910.6 total, 34829.9 free, 104047.5 used, 22033.1 buff/cache
490963:MiB Mem : 160910.6 total, 31841.1 free, 104259.1 used, 24810.4 buff/cache
491862:MiB Mem : 160910.6 total, 29232.8 free, 104038.6 used, 27639.2 buff/cache
492757:MiB Mem : 160910.6 total, 26424.3 free, 104029.0 used, 30457.3 buff/cache
493650:MiB Mem : 160910.6 total, 23600.8 free, 104035.0 used, 33274.8 buff/cache
494550:MiB Mem : 160910.6 total, 21883.7 free, 103876.1 used, 35150.8 buff/cache
495446:MiB Mem : 160910.6 total, 21620.7 free, 103817.4 used, 35472.6 buff/cache
~~snip~~
``
I wonder if this is due to supports_feature=true
in our api queries. Those force loading all providers into memory.
Typically there are not that many providers, so this will slide through. But once we get enough providers, then forcing the filtering out of the db and into ram will start to show up.
Question:
Do we know what the server was doing when it hit an out of memory?
Do we know what the server was doing when it hit an out of memory?
@kbrock I just shared an internal box folder that contains a journalctl save that includes the "crash" - journalctl_20240214000000-20240214003000.txt
at around Feb 14 00:10:00
.
The first OOM message I see are related to the DB connection:
$ grep -n OutOfMemory journalctl_20240214000000-20240214003000.txt
1100:Feb 14 00:02:30 rdr-cloud-usage.com evm[2580]: WARN -- evm: MIQ(VmdbDatabaseConnection.log_statistics) Unable to log stats, 'PG::OutOfMemory: ERROR: out of memory
1184:Feb 14 00:04:30 rdr-cloud-usage.com production[1853442]: ERROR -- production: There was an exception - ActiveRecord::StatementInvalid(PG::OutOfMemory: ERROR: out of memory
1246:Feb 14 00:04:30 rdr-cloud-usage.com api[3789]: ERROR -- api: MIQ(Api::AuthController.api_error) ActiveRecord::StatementInvalid: PG::OutOfMemory: ERROR: out of memory
1248:Feb 14 00:04:30 rdr-cloud-usage.com production[3789]: FATAL -- production: Error caught: [ActiveRecord::StatementInvalid] PG::OutOfMemory: ERROR: out of memory
1979:Feb 14 00:07:41 rdr-cloud-usage.com api[807264]: ERROR -- api: MIQ(Api::AuthController.api_error) ActiveRecord::StatementInvalid: PG::OutOfMemory: ERROR: out of memory
1986:Feb 14 00:07:41 rdr-cloud-usage.com production[807264]: FATAL -- production: Error caught: [ActiveRecord::StatementInvalid] PG::OutOfMemory: ERROR: out of memory
From the top output it looks like used stays flat while buff/cache increases:
This is extremely odd, memory used by buffer cache shouldn't cause OOM it is supposed to be able to be freed at anytime. If there was an OOM while there was available memory that is really weird.
One thing that comes to mind, what is the filesystem usage looking like here? If you're using mmap'd memory pages backed by files and you run out of space on the filesystem it'll actually trigger a memory fault not a ENOSPC
From the top output it looks like used stays flat while buff/cache increases:
This is extremely odd, memory used by buffer cache shouldn't cause OOM it is supposed to be able to be freed at anytime. If there was an OOM while there was available memory that is really weird.
I might have confused things a bit here - the top_output.log
doesn't include data from when the OOM occurred, just from a subsequent memory usage jump.
what is the filesystem usage looking like here?
[root@rdr-cloud-usage ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 79G 0 79G 0% /dev
tmpfs 79G 41M 79G 1% /dev/shm
tmpfs 79G 1.9M 79G 1% /run
tmpfs 79G 0 79G 0% /sys/fs/cgroup
/dev/mapper/vg_system-lv_os 11G 6.5G 4.1G 62% /
/dev/vda1 1014M 216M 799M 22% /boot
/dev/mapper/vg_system-lv_home 1014M 47M 968M 5% /home
/dev/mapper/vg_system-lv_var 12G 2.4G 9.7G 20% /var
/dev/mapper/vg_pg-lv_pg 500G 26G 475G 6% /var/lib/pgsql
/dev/mapper/vg_system-lv_var_log 21G 14G 7.1G 67% /var/log
/dev/mapper/vg_system-lv_var_log_audit 507M 66M 441M 13% /var/log/audit
/dev/mapper/vg_system-lv_tmp 1014M 40M 975M 4% /tmp
tmpfs 16G 0 16G 0% /run/user/0
One other thing that comes to mind is I've heard of some issues on linux (specifically with ZFS on linux since ZFS makes heavy use of in-memory ARC cache) with not being able to free cached memory "fast enough" to prevent an OOM.
This is assuming there was still available memory when the OOM hit which we don't know.
This issue has been automatically marked as stale because it has not been updated for at least 3 months.
If you can still reproduce this issue on the current release or on master
, please reply with all of the information you have about it in order to keep the issue open.