yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[DocDB] Postgress Backends are terminated even when total memory usage was half of total available memory on the node.

Open shantanugupta-yb opened this issue 2 years ago • 1 comments

Jira Link: DB-2731

Description

As a part of testing changes [DB-2687 Make OOM killer prioritize PG backends with more memory], I observed that all the postgres backends got terminated even when the total memory usage was half of total available memory.

Test and instance details:

  • Cluster: 3 node/C5.Large instance
  • Sysbench oltp_read_only workload with 300 tables/25k rows per table with 30 threads running in parallel.

The total available memory on the node is 3.5G and the actual memory usage was only 1.6G(RSS). The issue here is when we have enough available free memory then why the postgres backend was terminated by signal 9 .

Postgres logs:

2022-06-22 18:20:59.883 UTC [18282] LOG: server process (PID 7916) was terminated by signal 9: Killed 2022-06-22 18:20:59.885 UTC [18282] LOG: terminating any other active server processes 2022-06-22 18:20:59.926 UTC [7913] WARNING: terminating connection because of crash of another server process 2022-06-22 18:20:59.926 UTC [7913] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2022-06-22 18:21:00.687 UTC [18282] LOG: all server processes terminated; reinitializing 2022-06-22 18:21:01.208 UTC [8982] LOG: database system was interrupted; last known up at 2022-06-22 17:18:34 UTC 2022-06-22 18:21:01.325 UTC [8982] LOG: database system was not properly shut down; automatic recovery in progress 2022-06-22 18:21:01.337 UTC [8982] LOG: redo starts at 0/1000108 2022-06-22 18:21:01.337 UTC [8982] LOG: invalid record length at 0/10001B0: wanted 24, got 0 2022-06-22 18:21:01.337 UTC [8982] LOG: redo done at 0/1000140

Dmesg log: [Wed Jun 22 18:21:00 2022] Killed process 7916 (postgres), UID 997, total-vm:650712kB, anon-rss:299340kB, file-rss:0kB, shmem-rss:52kB

image

Below is the excerpt from the dmesg log about the OOM killer message dump image

OOM score of postgres/tserver/yb-master processes image

shantanugupta-yb avatar Jun 23 '22 07:06 shantanugupta-yb

Based on the OOM killer prints:

  • RSS is much higher for Postgres backends.

Does top command output captured before the OOM happened ? @shantanugupta-yb

sushantrmishra avatar Jun 23 '22 16:06 sushantrmishra