MoonGen
MoonGen copied to clipboard
Intel XL710 40Gbps saturation
Hi, I'm trying to saturate a XL710 Intel NIC with 64B packets. On a single core I manage to obtain 21Mpps (which is 11Gbps). From your paper I understood that these NICs can get up to 22Gbps w/ 64B packets, so I tried to create multiple sender slaves on different cores. The results are kind of strange: I get around 13Mpps received in total, but that's also the number that every Tx queue statistics tells me, even if in my code I've created three different Tx counters, one for every Tx queue (see later).
[Device: id=1] RX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6637 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=1] RX: 12.99 Mpps, 6652 Mbit/s (8730 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=1] RX: 12.95 Mpps, 6632 Mbit/s (8704 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)
My master and slave functions are as follows. They are taken from this test of the software-switches suite.
function master(args)
txDev = device.config{port = args.txDev, rxQueues = 4, txQueues = 4}
rxDev = device.config{port = args.rxDev, rxQueues = 4, txQueues = 4}
device.waitForLinks()
-- max 1kpps timestamping traffic timestamping
-- rate will be somewhat off for high-latency links at low rates
if args.rate > 0 then
txDev:getTxQueue(0):setRate(args.rate - (args.size + 4) * 8 / 1000)
txDev:getTxQueue(1):setRate(args.rate - (args.size + 4) * 8 / 1000)
txDev:getTxQueue(3):setRate(args.rate - (args.size + 4) * 8 / 1000)
end
rxDev:getTxQueue(0).dev:UdpGenericFilter(rxDev:getRxQueue(3))
mg.startTask("loadSlave", txDev:getTxQueue(0), rxDev, args.size)
mg.startTask("loadSlave", txDev:getTxQueue(1), rxDev, args.size)
mg.startTask("loadSlave", txDev:getTxQueue(3), rxDev, args.size)
mg.startTask("receiveSlave", rxDev:getRxQueue(3), rxDev, args.size)
mg.waitForTasks()
end
function loadSlave(queue, rxDev, size)
log:info(green("Starting up: LoadSlave"))
-- retrieve the number of xstats on the recieving NIC
-- xstats related C definitions are in device.lua
local numxstats = 0
local xstats = ffi.new("struct rte_eth_xstat[?]", numxstats)
-- because there is no easy function which returns the number of xstats we try to retrieve
-- the xstats with a zero sized array
-- if result > numxstats (0 in our case), then result equals the real number of xstats
local result = C.rte_eth_xstats_get(rxDev.id, xstats, numxstats)
numxstats = tonumber(result)
local mempool = memory.createMemPool(function(buf)
fillUdpPacket(buf, size)
end)
local bufs = mempool:bufArray()
local txCtr = stats:newDevTxCounter(queue, "plain")
local baseIP = parseIPAddress(SRC_IP_BASE)
local dstIP = parseIPAddress(DST_IP)
-- send out UDP packets until the user stops the script
while mg.running() do
bufs:alloc(size)
for i, buf in ipairs(bufs) do
local pkt = buf:getUdpPacket()
pkt.ip4.src:set(baseIP)
pkt.ip4.dst:set(dstIP)
end
-- UDP checksums are optional, so using just IPv4 checksums would be sufficient here
--bufs:offloadUdpChecksums()
queue:send(bufs)
txCtr:update()
end
txCtr:finalize()
end
Do you have any best practice when scaling to multiple queues and cores for the same NIC? I also tried to use the tx-multi-core.lua test you used for your paper but those scripts are not compatible anymore. Cheers
Can you post your code that you use for receiveSlave
?
Ooops, sorry I forgot to paste it. There it is:
function receiveSlave(rxQueue, rxDev, size)
log:info(green("Starting up: ReceiveSlave"))
local mempool = memory.createMemPool()
local rxBufs = mempool:bufArray()
local rxCtr = stats:newDevRxCounter(rxDev, "plain")
-- this will catch a few packet but also cause out_of_buffer errors to show some stats
while mg.running() do
local rx = rxQueue:tryRecvIdle(rxBufs, 10)
rxBufs:freeAll()
rxCtr:update()
end
rxCtr:finalize()
end
That should work, not sure what is going on here, I'll need to test this on real hardware; I'll get back to this
Hi @emmericp, I think I'm having a similar problem. I use this simple, stripped down example to test multi-core performance:
local mg = require "moongen"
local memory = require "memory"
local device = require "device"
local stats = require "stats"
local PKT_SIZE = 60
function configure(parser)
parser:description("Generates traffic.")
parser:argument("dev", "Device to transmit from."):convert(tonumber)
parser:option("-c --core", "Number of cores."):default(1):convert(tonumber)
end
function master(args)
dev = device.config({port = args.dev, txQueues = args.core})
device.waitForLinks()
for i=0,args.core-1 do
mg.startTask("loadSlave", dev:getTxQueue(i))
end
local ctr = stats:newDevTxCounter(dev)
while mg.running() do
ctr:update()
mg.sleepMillisIdle(10)
end
ctr:finalize()
end
function loadSlave(queue)
local mem = memory.createMemPool(function(buf)
buf:getUdpPacket():fill({
pktLength=PKT_SIZE
})
end)
local bufs = mem:bufArray()
while mg.running() do
bufs:alloc(PKT_SIZE)
queue:send(bufs)
end
end
On an Intel Xeon Gold 5120 with 14 physical cores (HyperThreading disabled) I get the following numbers:
Cores | Mpps |
---|---|
1 | 21.42 |
2 | 15.44 |
3 | 13.75 |
4 | 13.87 |
5 | 13.69 |
6 | 13.81 |
On another machine with an Intel Xeon E3-1245 4 cores + HyperThreading (8 logical cores) I get the following:
Cores | Mpps |
---|---|
1 | 21.40 |
2 | 34.64 |
3 | 33.96 |
4 | 34.65 |
5 | 42.62 |
6 | 42.65 |
In this last case I'm able to saturate the link but I'm wasting a lot of cores. On both machines I'm able to saturate the link with just two cores using pktgen-dpdk (v20.11.3 on dpdk 20.08)