MoonGen icon indicating copy to clipboard operation
MoonGen copied to clipboard

Intel XL710 40Gbps saturation

Open marcofaltelli opened this issue 4 years ago • 4 comments

Hi, I'm trying to saturate a XL710 Intel NIC with 64B packets. On a single core I manage to obtain 21Mpps (which is 11Gbps). From your paper I understood that these NICs can get up to 22Gbps w/ 64B packets, so I tried to create multiple sender slaves on different cores. The results are kind of strange: I get around 13Mpps received in total, but that's also the number that every Tx queue statistics tells me, even if in my code I've created three different Tx counters, one for every Tx queue (see later).


[Device: id=1] RX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6637 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=1] RX: 12.99 Mpps, 6652 Mbit/s (8730 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=1] RX: 12.95 Mpps, 6632 Mbit/s (8704 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)

My master and slave functions are as follows. They are taken from this test of the software-switches suite.

function master(args)
	txDev = device.config{port = args.txDev, rxQueues = 4, txQueues = 4}
	rxDev = device.config{port = args.rxDev, rxQueues = 4, txQueues = 4}
	device.waitForLinks()
	-- max 1kpps timestamping traffic timestamping
	-- rate will be somewhat off for high-latency links at low rates
	if args.rate > 0 then
		txDev:getTxQueue(0):setRate(args.rate - (args.size + 4) * 8 / 1000)
		txDev:getTxQueue(1):setRate(args.rate - (args.size + 4) * 8 / 1000)
		txDev:getTxQueue(3):setRate(args.rate - (args.size + 4) * 8 / 1000)
	end
	rxDev:getTxQueue(0).dev:UdpGenericFilter(rxDev:getRxQueue(3))

	mg.startTask("loadSlave", txDev:getTxQueue(0), rxDev, args.size)
	mg.startTask("loadSlave", txDev:getTxQueue(1), rxDev, args.size)
	mg.startTask("loadSlave", txDev:getTxQueue(3), rxDev, args.size)
	mg.startTask("receiveSlave", rxDev:getRxQueue(3), rxDev, args.size)
	mg.waitForTasks()
end

function loadSlave(queue, rxDev, size)

	log:info(green("Starting up: LoadSlave"))


	-- retrieve the number of xstats on the recieving NIC
	-- xstats related C definitions are in device.lua
	local numxstats = 0
       	local xstats = ffi.new("struct rte_eth_xstat[?]", numxstats)

	-- because there is no easy function which returns the number of xstats we try to retrieve
	-- the xstats with a zero sized array
	-- if result > numxstats (0 in our case), then result equals the real number of xstats
	local result = C.rte_eth_xstats_get(rxDev.id, xstats, numxstats)
	numxstats = tonumber(result)

	local mempool = memory.createMemPool(function(buf)
		fillUdpPacket(buf, size)
	end)
	local bufs = mempool:bufArray()
	local txCtr = stats:newDevTxCounter(queue, "plain")
	local baseIP = parseIPAddress(SRC_IP_BASE)
	local dstIP = parseIPAddress(DST_IP)

	-- send out UDP packets until the user stops the script
	while mg.running() do
		bufs:alloc(size)
		for i, buf in ipairs(bufs) do
			local pkt = buf:getUdpPacket()
			pkt.ip4.src:set(baseIP)
			pkt.ip4.dst:set(dstIP)
		end
		-- UDP checksums are optional, so using just IPv4 checksums would be sufficient here
		--bufs:offloadUdpChecksums()
		queue:send(bufs)
		txCtr:update()
	end
	txCtr:finalize()
end

Do you have any best practice when scaling to multiple queues and cores for the same NIC? I also tried to use the tx-multi-core.lua test you used for your paper but those scripts are not compatible anymore. Cheers

marcofaltelli avatar Dec 03 '20 12:12 marcofaltelli

Can you post your code that you use for receiveSlave?

emmericp avatar Dec 03 '20 14:12 emmericp

Ooops, sorry I forgot to paste it. There it is:

function receiveSlave(rxQueue, rxDev, size)
	log:info(green("Starting up: ReceiveSlave"))

	local mempool = memory.createMemPool()
	local rxBufs = mempool:bufArray()
	local rxCtr = stats:newDevRxCounter(rxDev, "plain")

	-- this will catch a few packet but also cause out_of_buffer errors to show some stats
	while mg.running() do
		local rx = rxQueue:tryRecvIdle(rxBufs, 10)
		rxBufs:freeAll()
		rxCtr:update()
	end
	rxCtr:finalize()
end

marcofaltelli avatar Dec 03 '20 15:12 marcofaltelli

That should work, not sure what is going on here, I'll need to test this on real hardware; I'll get back to this

emmericp avatar Dec 03 '20 20:12 emmericp

Hi @emmericp, I think I'm having a similar problem. I use this simple, stripped down example to test multi-core performance:

local mg     = require "moongen"
local memory = require "memory"
local device = require "device"
local stats  = require "stats"

local PKT_SIZE	= 60

function configure(parser)
	parser:description("Generates traffic.")
	parser:argument("dev", "Device to transmit from."):convert(tonumber)
	parser:option("-c --core", "Number of cores."):default(1):convert(tonumber)
end

function master(args)
	dev = device.config({port = args.dev, txQueues = args.core})
	device.waitForLinks()

	for i=0,args.core-1 do
		mg.startTask("loadSlave", dev:getTxQueue(i))
	end

	local ctr = stats:newDevTxCounter(dev)
	
	while mg.running() do
		ctr:update()
		mg.sleepMillisIdle(10)
	end

	ctr:finalize()
end

function loadSlave(queue)
	local mem = memory.createMemPool(function(buf)
		buf:getUdpPacket():fill({
			pktLength=PKT_SIZE
		})
	end)
	local bufs = mem:bufArray()

	while mg.running() do
		bufs:alloc(PKT_SIZE)
		queue:send(bufs)
	end
end

On an Intel Xeon Gold 5120 with 14 physical cores (HyperThreading disabled) I get the following numbers:

Cores Mpps
1 21.42
2 15.44
3 13.75
4 13.87
5 13.69
6 13.81

On another machine with an Intel Xeon E3-1245 4 cores + HyperThreading (8 logical cores) I get the following:

Cores Mpps
1 21.40
2 34.64
3 33.96
4 34.65
5 42.62
6 42.65

In this last case I'm able to saturate the link but I'm wasting a lot of cores. On both machines I'm able to saturate the link with just two cores using pktgen-dpdk (v20.11.3 on dpdk 20.08)

FedeParola avatar Mar 29 '21 14:03 FedeParola