[VPP-1947] Simpler processing is less efficient with more workers

Open vvalderrv opened this issue 10 months ago • 7 comments

Description

This is not a new issue, but I probably have not opened a Jira ticket for this yet.

It seems simpler processing (tests with higher throughput when 1 core is used) start showing performance degradation, eventually reaching lower throughput (than other tests with similar but less simple processing) when multiple cores are used.

Example graphs from 2009 report: [0].

As far as I can tell, this affects all NICs (if the performance is not higher than hardware limits), all divers (RDMA, DPDK, AVF) and all architectures (they only differ in the amount of cores where the regression starts).

I have prepared a small test run [1], on Haswell as other testbeds are busy. Haswell is 3-node testbed (no hyperthreading), so outputs are from 2 VPP boxes, and traffic is not entirely symmetric on them (one directions has less packets, as some were already lost on the other box). I will collect more tests later.

L2patch is faster than l2bdbasemaclrn with 2 cores, but slower with 4 cores. It affects both MRR and NDR/PDR results. Both tests use the same traffic profile, only VPP configuration is different.

Looking at "show run" [2], there is some mismatch between *-output and *-tx nodes, but not big enough to explain the performance. Number of processed packets per cycle is low, so ideally no loss should happen.

Looking at statistics [3] after measurements, I see rx_dropped_packets large enough to explain the performance. But I see no good reason why RX buffers should get full with this small number of cycles per packets.

The only explanation I see is that more frequent polling somehow causes the packet loss (as if reading from RX queue applies a lock, preventing NIC from adding packets there). Is that possible? If yes, are there any recommended workarounds? If no, is there a bug in VPP to fix?

[0] https://docs.fd.io/csit/rls2009/report/vpp_performance_tests/throughput_speedup_multi_core/l2-2n-clx-xxv710.html

[1] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-hsw/1161/archives/log.html.gz#s1-s1-s1-s1

[2] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-hsw/1161/archives/log.html.gz#s1-s1-s1-s1-s3-t2-k2-k9-k6-k7-k1-k1-k1-k1-k12-k1-k1-k1-k1

[3] https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-hsw/1161/archives/log.html.gz#s1-s1-s1-s1-s3-t2-k2-k9-k6-k10-k1-k1-k1-k1

Assignee

Unassigned

Reporter

Vratko Polak

Comments

vrpolak (Thu, 22 Feb 2024 15:26:43 +0000):
Some comparisons ([5] for 2n-spr AVF and [6] for 3n-tsh) from last release confirming this is still an issue.

[5] https://csit.fd.io/comparisons/#eNqNkE0OgjAQhU-DG1MDBWTFQmBrYogXaGBQIi3NtBD19Lb8WNmZNGnfvK-d11HQQaWhTr0k82iC0ACCqMCcvfC0n6sK9KKvOBgr309Fe7PtxQathxU12yilFTNvnBHQmdgpGvmU0PAQ-ATNa0yBw1vRIHN0KIhWdxL4N6AyeMbUJ7WsH46vevyGjm0SmlsZF4vfIFfte4sco2zDaP2SW-JSlK7FuSwXPJnW2phLhn8MzFCMg_6dwRp6BkbWDeDMaPrA1GoneuSpjRQXH-SZcFs

[6] https://csit.fd.io/comparisons/#eNqNkM8OgjAMxp8GL2RmDBAvHASuJob4AgsUQwJjdoNEn96NP05uJku29vu1_VYFHVQa6tRLMo8lCA0giArM2wsv_pJVoNf4jqORcn9O2sp2EDu0HjfUXJOUNlh4o0yATsROsYgywsJjQAmablyBw1vRIHc0E0RJJAGlD2AygHNAqyfhU-MqqgG_tmPrheU2jItVb7BX7XuPnKJsx2j9knviVpRuxLUsVzyZzza4lxz_WJmheA_6dwub6QWYeDeCE6P5A_OogxiwT62luPgA2AVw7Q

vrpolak (Mon, 24 Apr 2023 10:54:57 +0000): Recent example: A merge [9] into VPP causes multiple progressions, but also few regressions in trending [10].

[9] https://gerrit.fd.io/r/c/vpp/+/38452?forceReload=true

[10] https://csit.fd.io/trending/#eNrdl99OgzAUxp8Gb0wTWgfsxgsn72G6ciZE2GpbCfPpbdF4usyhCX9SdwGknMP5Pn45nAZtDgqeNNT3UbKJsk3Esqqwp-ju4dZeWikJ25NKdITG8TMwSWFNY_FKeLsjNSPpakuoIGBKu5LciJI0SrkC7NEVKN7MSTWMyPKIkYsamM8VcHygZhgxoD0Rzwqm7BRvQFfvgHnWOcaFpYAhKk6Lm6P0ol_vl-V9xkhkndhyDWEw-QSGLRCFi-zN5oTuaZO-xHa5K02hloYvcaS_n26rs1o3H-flVx51NzKHmO5nclcoma1BrB5bkIEl5JsjeBSd2dBcOkguG83g-Cs-XBTT7b_gztX0y234GNn2uTEQtxqs2zgZ5rXFGXzbR9TsdsmT5L8pv9QTX9n0GSfwBvyz0l

vrpolak (Thu, 24 Feb 2022 12:26:34 +0000):
Adding links to runtime statistics after trial at PDR rate. Still for 2110 VPP, dpdk_plugin, l2-patch, but on 2n-skx.

This time there is no mismatch on tx, not rx misses, but still 4c [8] forwards less packets than 2c [7] (which forwards slightly more than 1c [6]).

[6] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2110-2n-skx/97/log.html.gz#s1-s1-s1-s5-s20-t1-k2-k9-k20-k14-k1-k1-k1-k1

[7] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2110-2n-skx/97/log.html.gz#s1-s1-s1-s5-s20-t2-k2-k9-k20-k14-k1-k1-k1-k1

[8] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2110-2n-skx/97/log.html.gz#s1-s1-s1-s5-s20-t3-k2-k9-k20-k14-k1-k1-k1-k1

vrpolak (Tue, 16 Nov 2021 11:14:33 +0000):
This also affects testpmd on most architectures:

[5] https://s3-docs.fd.io/csit/master/report/dpdk_performance_tests/throughput_speedup_multi_core/2n-skx-xxv710.html

vrpolak (Tue, 9 Nov 2021 14:44:40 +0000):
Still present, adding newer link to report (not officially released yet):

[4] https://s3-docs.fd.io/csit/master/report/vpp_performance_tests/throughput_speedup_multi_core/l2-2n-clx-xxv710.html

Original issue: https://jira.fd.io/browse/VPP-1947

Feb 02 '25 10:02 vvalderrv