[RESEARCH] High memory consumption after frang configuration
Motivation
After frang configuration in PR 598 tests started to fail with
[ 6570.228871] ksoftirqd/0: page allocation failure: order:9, mode:0x40a20(GFP_ATOMIC|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[ 6570.229960] CPU: 0 PID: 12 Comm: ksoftirqd/0 Tainted: G OE 5.10.35.tfw-04d37a1 #1
[ 6570.230476] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[ 6570.231459] Call Trace:
[ 6570.231964] dump_stack+0x74/0x92
[ 6570.232458] warn_alloc.cold+0x7b/0xdf
Research needed
Test to reproduce - tls/test_tls_integrity.ManyClients and tls/test_tls_integrity.ManyClientsH2 with -T 1 option and body > 16KB (8GB memory)
Looks like not enough memory.
I receive a lof of Warning: cannot alloc memory for TLS encryption. and traceback
[26347.797820] CPU: 6 PID: 50 Comm: ksoftirqd/6 Kdump: loaded Tainted: P W OE 5.10.35.tfw-04d37a1 #1
[26347.797821] Hardware name: Micro-Star International Co., Ltd. GF63 Thin 11UC/MS-16R6, BIOS E16R6IMS.10D 06/23/2022
[26347.797822] Call Trace:
[26347.797829] dump_stack+0x74/0x92
[26347.797831] warn_alloc.cold+0x7b/0xdf
[26347.797834] __alloc_pages_slowpath.constprop.0+0xd2e/0xd60
[26347.797835] ? prep_new_page+0xcd/0x120
[26347.797837] __alloc_pages_nodemask+0x2cf/0x330
[26347.797839] alloc_pages_current+0x87/0xe0
[26347.797841] kmalloc_order+0x2c/0x100
[26347.797842] kmalloc_order_trace+0x1d/0x80
[26347.797843] __kmalloc+0x3e9/0x470
[26347.797857] tfw_tls_encrypt+0x7a2/0x820 [tempesta_fw]
[26347.797860] ? memcpy_fast+0xe/0x10 [tempesta_lib]
[26347.797867] ? tfw_strcpy+0x1ae/0x2b0 [tempesta_fw]
[26347.797870] ? irq_exit_rcu+0x42/0xb0
[26347.797872] ? sysvec_apic_timer_interrupt+0x48/0x90
[26347.797873] ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[26347.797880] ? tfw_h2_make_frames+0x1da/0x370 [tempesta_fw]
[26347.797886] ? tfw_h2_make_data_frames+0x19/0x20 [tempesta_fw]
[26347.797892] ? tfw_sk_prepare_xmit+0x69c/0x7b0 [tempesta_fw]
[26347.797898] tfw_sk_write_xmit+0x6a/0xc0 [tempesta_fw]
[26347.797900] tcp_tfw_sk_write_xmit+0x36/0x80
[26347.797902] tcp_write_xmit+0x2a9/0x1210
[26347.797903] __tcp_push_pending_frames+0x37/0x100
[26347.797904] tcp_push+0xfc/0x100
[26347.797910] ss_tx_action+0x492/0x670 [tempesta_fw]
[26347.797912] net_tx_action+0x9c/0x250
[26347.797914] __do_softirq+0xd9/0x291
[26347.797915] run_ksoftirqd+0x2b/0x40
[26347.797916] smpboot_thread_fn+0xd0/0x170
[26347.797918] kthread+0x114/0x150
[26347.797918] ? sort_range+0x30/0x30
[26347.797919] ? kthread_park+0x90/0x90
[26347.797921] ret_from_fork+0x1f/0x30
[26347.797923] Mem-Info:
[26347.797925] active_anon:132045 inactive_anon:1833119 isolated_anon:0
active_file:492217 inactive_file:119308 isolated_file:0
unevictable:199 dirty:23 writeback:0
slab_reclaimable:45118 slab_unreclaimable:41418
mapped:244887 shmem:205996 pagetables:15978 bounce:0
free:758589 free_pcp:3043 free_cma:0
I receive meamleak for these tests and tempesta commit - 10b38e071bad93e87d5fab6d0213246d3b1b5c84. I used remote setup (Tempesta on a separate VM) and cmd ./run_tests.py -T 1 tls/test_tls_integrity.ManyClientsH2 with MTU 80. So I run this test with 16KB, 64KB and 200KB body and I see the usage of all available memory (6GB for my VM for Tempesta) and meamleak after test ~1GB
look like it is fixed in #2105. I cannot get meamleak for this PR, but I see the usage of all available memory. I think Tempesta uses an unexpected lot of memory in these tests. 10 clients with 64KB response/request body, python uses ~400MB, but Tempesta ~5GB, why?
Here we have the next situation for 64KB test.
In this test, Tempesta FW receives 65536 bytes of data request from 10 clients, routes them to a server, gets responds from a server and sends them to the clients. With option -T 1, each request and respond are split by byte. The key point is that if Tempesta FW receives only one byte, it uses a full skb (about 900 bytes).
Tempesta FW receives at least 655360 skbs from clients, it is 655360 * 900 = 589 824 000 bytes. Tempesta FW makes copies of all skbs in ss_skb_unroll() because all skbs are marked as cloned. Since the original skbs are marked as SKB_FCLONE_CLONE, they are not freed after consume_skb() right at this point.
Next, before routing these skbs to the server Tempesta FW makes clones in ss_send() with the purpose of resending if something goes wrong.
After the server has responded, Tempesta FW receives the same amount of skbs as from the clients. And as all skbs are marked as cloned, it makes copies of these skbs.
Here we have allocated at least 589 824 000 * 5 = 2 949 120 000 bytes and only after Tempesta FW starts sending responds to clients, it starts freeing skbs.
@biathlon3 thank you for the detailed analysis! I still have a couple of questions and appreciate your elaboration on them:
skb_cloned()inss_skb_unroll()comes underunlikely()and IIRC this is because modern HW NICs form skbs with data in pages only (unfortunately, I don't remember why clones appear otherwise). So please research why clones appear in the network stack? Whether moving to a different virtual adapters (e.g. virtio-net or SR-IOV) helps to avoid the clones? Please see https://tempesta-tech.com/knowledge-base/Hardware-virtualization-performance/ . Since virtual environments aren't rare, probably we need to removeunlikelyadd comments to the code why clones appear and rework our wiki in recommendations for virtual environments.- What sk_buff spends 900 bytes for? Could you please write down how much memory which parts of SKB spend and which the Linux kernel compilation options may reduce the memory footprint. This probably can be documented in our wiki.
What sk_buff spends 900 bytes for? Could you please write down how much memory which parts of SKB spend
Empty skb immidiatelly after ss_skb_alloc(0) or received skb: sizeof(struct sk_buff) = 232 hdr_len = 320 sizeof(struct skb_shared_info) = 320
232 + 320 + 320 = 872 Actually little bit more, the smalest truesize=896
So please research why clones appear in the network stack? Whether moving to a different virtual adapters (e.g. virtio-net or SR-IOV) helps to avoid the clones?
Skbs are marked as cloned when the test is started on the same virtual machine as Tempesta FW and it is not related to the type of virtual adapter.
If the test works on a separate VM, Tempesta FW receives uncloned skbs, with data is collected in pages, and within parsing process Tempesta FW calls ss_skb_split() for each portion of data. Anyway this variant is not as memory-demanding as the first. But in tls.test_tls_integrity.ManyClientsH2 Tempesta FW additionally has to translate requests to HTTP/1 and responses back to HTTP/2 and this also costs extra memory.