snabb icon indicating copy to clipboard operation
snabb copied to clipboard

100G Packet Rates: Per-CPU vs Per-Port

Open lukego opened this issue 8 years ago • 102 comments

I am pondering how to think about packet rates in the 100G era. How should we be designing and optimizing our software?

Consider these potential performance targets:

  • A: 1x100G @ 64 Mpps.
  • B: 1x100G @ 96 Mpps.
  • C: 2x100G @ 128 Mpps (max 64 Mpps per port).

I have a whole mix of questions about these in my mind:

  • Which target is most meaningful? For which applications?
  • How do you optimize for each target?
  • Which one is harder to achieve? Is the main challenge software optimization or hardware selection?

Raw brain dump...

  • Performance "A" seems suitable for many applications. This is providing 100G of bandwidth for average packet size of around 192 bytes or higher.
  • Performance "B" may be more suitable for some specific applications. Load generators (like packetblaster) may want to maximize the packet rate on a single port. Packet capture applications (like firehose) may need to never miss a packet.
  • Performance "C" seems suitable for applications that need to scale out. For example IPv4-IPv6 translation (lwAFTR) may want to maximize hardware density by handling as many 100G ports per server as possible.

So how would you optimize for each? In every case you would surely use multiple cores with RSS or equivalent traffic dispatching. Beyond that...

  • Performance "A" would let you choose your own trade-off between software optimization or throwing hardware at the problem. If you are ambitious you may pick a low-end CPU like a Xeon E3-1650v3 (6 cores @ 3.5 GHz) for 328 cycles per packet processing budget. If you prefer to throw hardware at the problem you may pick a high-end CPU like a Xeon E5-2699v4 (22 cores @ 2.2 GHz) for 756 cycles per packet. There are plenty of points in between, too.
  • Performance "B" is putting more strain on both the CPU and the NIC. You may need to use special driver routines to get the NIC performance you want & this may involve trading off CPU resources to reduce load on the NIC. For example, ConnectX-4 NICs may provide better packet rates with "inline descriptor" mode but this involves a complete packet copy in the driver routine i.e. spending CPU cycles and cache footprint to assist the DMA engine on the NIC. Similarly the NICs have hardware features for e.g. tuple-matching on packets but those surely have performance limits (e.g. rules checked per second) that you are more likely to discover the harder you press them.
  • Performance "C" is putting twice the strain on the CPU compared with "A". This would require software optimization since it halves the cycles-per-packet budget. It should not cause NIC problems since the per-port load is the same. There seems to be a risk of uncovering performance limits in the "uncore" parts of the processor e.g. the L3 cache, the DMA engine, the RAM controller, the IOMMU (if it were used), and so on.

So which would be hardest to achieve? and why?

The one I have a bad feeling about is "B". Historically we are used to NICs that can do line-rate with 64B packets. However, those days may be behind us. If you read Intel datasheets then the minimum packet rate they are guaranteeing is 64B for 10G (82599), 128B for 40G (XL710), and 256B for 100G (FM10K). (This is lower even than performance "A".) If our performance targets for the NICs are above what they are designed for then we are probably headed for trouble. I think if we want to support really high per-port packet rates then it will take a lot of work and we will be very contained in which hardware we can choose (both vendor and enabled features).

So, stumbling back towards the development de jour, I am tempted to initially accept the 64 Mpps per port limit observed in #1007 and focus on supporting "A" and "C". In practical terms this means spending my efforts on writing simple and CPU-efficient transmit/receive routines rather than looking for complex and CPU-expensive ways to squeeze more packets down a single card. We can always revisit the problem of squeezing the maximum packet rate out of a card in the context of specific applications (e.g. packetblaster and firehose) and there we may be able to "cheat" in some useful application-specific ways.

Early days anyway... next step is to see how the ConnectX-4 performs with simultaneous transmit+receive using generic routines. Can't take the 64 Mpps figure to the bank quite yet.

Thoughts?

lukego avatar Sep 07 '16 10:09 lukego

Nice description of the problem, and I agree with your conclusions. For most real applications you should be able to reduce b) to c) at the cost of additional ports. Exceptions? Artificial constraints such as "dragster-race" competitions (Internet2 land-speed records) or unrealistic customer expectations ("we only ever buy kit that does line rate even with 64-byte christmas-tree-packet workloads").

Cost of additional ports may be a problem, but that needs to be weighted against development costs as well. (You can formulate that as a time-to-market argument where you have the choice of either getting a working system now and upgrade it to the desired throughput once additional ports have gotten cheaper, or waiting until a "more efficient" system is developed that can do the same work with just one port :-)

sleinen avatar Sep 07 '16 13:09 sleinen

Relatedly: Nathan Owens pointed out to me via Twitter that the sexy Broadcom Tomahawk 32x100G switches only do line-rate with >= 250B packets. Seems to be confirmed on ipspace.net.

lukego avatar Sep 08 '16 09:09 lukego

As far as other switches go, Mellanox Spectrum can do line-rate at all packet sizes. Based on their "independent" testing, it seems Broadcom's spec is not 100% accurate, see page 11: http://www.mellanox.com/related-docs/products/tolly-report-performance-evaluation-2016-march.pdf

I haven't seen a number for Cavium Xpliant. On Thu, Sep 8, 2016 at 2:51 AM Luke Gorrie [email protected] wrote:

Relatedly: Nathan Owens pointed out to me via Twitter that the sexy Broadcom Tomahawk 32x100G switches only do line-rate with >= 250B packets. Seems to be confirmed on ipspace.net http://blog.ipspace.net/2015/12/broadcom-tomahawk-101.html.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/snabbco/snabb/issues/1013#issuecomment-245549003, or mute the thread https://github.com/notifications/unsubscribe-auth/ADT7LazQEjat9-FbjcATRK1Oi_fHKjOZks5qn9qVgaJpZM4J2yWB .

virtuallynathan avatar Sep 08 '16 15:09 virtuallynathan

I don't think you should go out of your way to support what seems to be a bad NIC, i.e. if it requires you to move packets to certain places and thus decreasing performance in Snabb it's a bad move.

I want to be able to get wirespeed performance out of this by asking a vendor to produce a fast NIC and then just throw more cores out of it. If someone don't need wirespeed they can buy a bad/cheaper NIC (seemingly like this Mellanox) and can use less cores.

Most importantly the decision on pps/bps should be with the end-user :)

The first 10G NICs I used didn't do much more than 5Gbps. I think it's too early in the life of 100G NICs to draw conclusions on general trends.

plajjan avatar Sep 08 '16 15:09 plajjan

Here are some public performance numbers from Mellanox: https://www.mellanox.com/blog/2016/06/performance-beyond-numbers-stephen-curry-style-server-io/

The headlines there are line-rate 64B with 25G NIC and 74.4 Mpps max on 100G. (I am told they have squeezed a bit more than this on the 100G but I haven't found a published account of that.)

Note that there are two different ASICs: "ConnectX-4" (100G) and "ConnectX-4 Lx" (10G/25G/40G/50G). If you needed more silicon horsepower per 100G, for example to do line-rate with 64B packets, maybe combining 4x25G NICs would be a viable option? (Is that likely to cause interop issues with 100G ports on switches/routers in practice?)

lukego avatar Sep 08 '16 18:09 lukego

I tested the ConnectX-4 with every packet size 60..1500 and at both 3.5 GHz and 2.0 GHz.

rplot01

Whaddayareckon?

lukego avatar Sep 08 '16 20:09 lukego

Interesting graph. Is it some fixed buffer size that leads to the plateaus?

plajjan avatar Sep 09 '16 08:09 plajjan

Good question. It looks like the size of each packet is effectively being rounded up to a multiple of 64. I wonder what would cause this?

Suspects to eliminate:

  • Software.
  • DMA.
  • Ethernet MAC/PHY.

lukego avatar Sep 09 '16 09:09 lukego

DMA/PCIe

I would really like to extend our PMU support to also track "uncore" counters like PCIe/RAM/NUMA activity. This way we could include all of those values in the data sets.

Meanwhile I created a little table by hand. This shows the PCIe activity on both sides of the first four distinct plateaus.

Mpps  PacketSize   PCIeRdCur (M)  DRd (M)  PCIeRead (GB)  PCIeWrite (GB)
37    190          1323           355      107            0.082
37    250          1656           356      128            0.082

30    260          1592           277      120            0.066
30    316          1591           284      120            0.066

25    320          1311           232       99            0.054
25    380          1529           232      112            0.054

21    384          1313           199       97            0.046
21    440          1512           204      110            0.046

This is based on me snipping bits from the output of the Intel Performance Counter Monitor tool. I found some discussion of its output here.

Here is a very preliminary idea of how I am interpreting these columns:

  • Mpps: Approximate packet rate of the plateau.
  • PacketSize: Bytes per packet (+CRC).
  • PCIeRdCur (M): Millions of 64B cache lines fetched from memory via PCIe.
  • DRd (M): Millions of 64B cache lines fetched from L3 cache via DDIO.
  • PCIeRead (GB) and PCIeWrite (GB): Total data read by NIC / written by NIC over PCIe. (Docs seem to say that this is Gigabytes but the numbers only makes sense to me as Gigabits.)

How to interpret this? In principle it seems tempting to blame the "64B-wide plateau" issue on DMA if it is fetching data in 64B cache lines. Trouble is that then I would expect to see the same level of PCIe traffic for both sides of the plateau -- and probably with PCIe bandwidth maxxed out at 128Gbps (PCIe 3.0 x16 slot). However, in most cases it seems like PCIe bandwidth is not maxxed out and the right-hand side of the plateau is transferring more data.

So: no smoking gun from looking at PCIe performance counters.

Ethernet MAC/PHY

I have never really looked closely at the internals of Layer-2 and Layer-1 on Ethernet. Just a few observations from reading wikipedia though:

So as a wild guess it seems possible that 100GbE would show some effects at 32-byte granularity (64 bit * 4 channel) based on the physical transport. However, this would only be 1/2 of the plateau size, and I don't know whether this 64-bit/4-channel grouping is visible in the MAC layer or just an internal detail of the physical layer.

I am running a test on a 10G ConnectX-4 NIC now just out of curiosity. If this showed plateaus with 1/4 the width then it may be reasonable to wonder if the issue is PHY related (10GbE also uses 64b/66b but via only one channel).

lukego avatar Sep 09 '16 11:09 lukego

probably want to look at L3 miss rate and/or DDR Rd counters as it steps.

Gen3 x16 will max out ~ 112Gbps after the encoding overhead in practice.

fmadio avatar Sep 09 '16 11:09 fmadio

I don't think 64b/66b has anything to do with this. That's just avoiding certain bit patterns on the wire and happens real close to the wire nor do I think it's related the AUI interface (which I assume you are referring to).

Doesn't the NIC copy packets from RAM to some little circular transmit buffer just before it sends them out? Is that buffer carved up in 64 byte slices?

plajjan avatar Sep 09 '16 11:09 plajjan

Yeah 64/66 encoding is not connected to this at all, there`s absolutely no flow control when your at that level, its 103.125Gbps or 0 Gbps with nothing in between.

There should be some wide FIFO`s before transferring to the CMAC and down the wire, but even then it should be at least 64B wide (read 512bit ) interface, which means 512b x say 250Mhz -> 128Gbps. More importantly that will effect the PPS rate which even at 128B packets would clock in at 125Mpps (2 clocks @ 250Mhz). My money is on L3/LLC or UnCore or QPI cache/request counts.

fmadio avatar Sep 09 '16 12:09 fmadio

@fmadio Yes, this sounds like the most promising line of inquiry now: Can we explain the performance here, including the plateau every 64B, in terms of the way the memory subsystem is serving DMA requests. And if the memory subsystem is the bottleneck then can we improve its performance e.g. by serving more requests from L3 cache rather than DRAM.

Time for me to read the Intel Uncore Performance Monitoring Reference Manual...

lukego avatar Sep 09 '16 19:09 lukego

Yup its probably QPI / L3 / DDR some where some how. Assuming the Tx payloads are unique memory locations the plateau is the PCIe requestor hitting a 64B line somewhere, the drop is additional latency to fetch the next line, probably Uncore -> QPI -> LLC/L3. Note that the PCIe EP on the Uncore does not do any prefetching such as the CPUs DCU streamer, thus its a cold hard miss... back to fun days of CPU`s with no cache!

If you really want to dig into it suggest getting a PCIe sniffer but those things are dam expensive :(

fmadio avatar Sep 10 '16 02:09 fmadio

@fmadio Great thoughts, keep 'em coming :).

I added a couple of modeled lines based on your suggestions:

  • Max 100GbE showing the theoretical maximum packet rate, based on the notion that the NIC will always transmit at 100Gbps & the MAC will add 24 bytes of per-packet overhead (CRC + Preamble + Gap).
  • Max PCIe/MLX4 showing the expected PCIe bandwidth limit, based on the notion that PCIe is transferring cache lines at 112Gbps and ConnectX-4 has one cache line per packet of overhead (64B transmit descriptor).

Here is how it looks (click to zoom):

rplot02

This looks in line with the theory of a memory/uncore bottleneck:

  • Cache line granularity explains the width and alignments of the plateaus.
  • Performance curve would be smoothed where it reaches line rate, but it never does.
  • Cache lines are not being delivered to the NIC fast enough to keep the transmitter busy.

One more interesting perspective is to change the Y-axis from Mpps to % of line rate:

rplot03

Looks to me like:

  • We are delivering ~80 Gbps of packet-data cache lines to the NIC.
  • If the packet size is a multiple of 64B then all the transferred data can be sent onto the wire, but otherwise part of the last cache line is not used and throughput drops.
  • There are a couple of sweet-spots around 256B where throughput reaches 85% of line rate. Perhaps more of the cache lines were served from L3 cache vs RAM here.

So the next step is to work out how to keep the PCIe pipe full with cache lines and break the 80G bottleneck.

lukego avatar Sep 10 '16 17:09 lukego

Cool, one thing totally forgot is 112Gbps is PCIe Posted Write bandwidth. As our capture devices is focused on Write to DDR I have not tested what the Max DDR Read bandwidth would be, its quite possible the system runs out of PCIe Tags at which point peek Read bandwidth would suffer.

Probably the only way to prefech data into the L3 is via the CPU, but that assumes the problem is L3 / DDR miss not something else. Would be interested if you limit the Tx buffer address to be < total L3 size. e.g. is the problem L3 -> DDR miss or something else.

fmadio avatar Sep 11 '16 02:09 fmadio

Also, for the Max PCIe/MLX4 green line. Looks like your off by 1 64B line some how ?

fmadio avatar Sep 11 '16 02:09 fmadio

This is an absolutely fascinating problem. Can't put it down :).

@fmadio Great info! So on the receive path the NIC uses "posted" (fire and forget) PCIe operations to write packet data to memory but on the transmit path it uses "non-posted" (request/reply) operations to read packet data from memory. So the receive path is like UDP but the transmit path is more like TCP where performance can be constrained by protocol issues (analogous to window size, etc).

I am particularly intrigued by the idea of "running out of PCIe tags." If I understand correctly the number of PCIe tags determines the maximum number of parallel requests. I found one PCIe primer saying that the typical number of PCIe tags is 32 (but can be extended up to 2048).

Now I am thinking about bandwidth delay products. If we know how much PCIe bandwidth we have (~220M 64B cache-lines per second for 112Gbps) and we know how many requests we can make in parallel (32 cache lines) then we can calculate the point at which latency will impact PCIe throughput:

delay  =  parallel / bandwidth  =  32 / 220M per sec  =  146 nanoseconds

So the maximum (average) latency we could tolerate for PCIe-rate would be 146 nanoseconds per cache line under these assumptions.

Could this be the truth? (Perhaps with slightly tweaked constants?) Is there a way to check without a hardware PCIe sniffer?

I made a related visualization. This shows nanoseconds per packet (Y-axis) based on payload-only packet size in cache lines (X-axis). The black line is the actual measurements (same data set as before). The blue line is a linear model that seems to fit the data very well.

rplot05

The slope of the line says that each extra cache line of data costs an extra 6.6 nanoseconds. If we assumed that 32 reads are being made in parallel then the actual latency would be 211 nanoseconds. Comparing this with the calculated limit of 146 nanoseconds for PCIe line rate we would expect to achieve around 70% of PCIe line rate.

This is a fairly elaborate model but it seems worth investigating because the numbers all seem to align fairly well to me. If this were the case then it would have major implications i.e. that the reason for all this fussing about L3 cache and DDIO is due to under-dimensioned PCIe protocol resources on the NIC creating artificially tight latency bounds on the CPU.

(Relatedly: The Intel FM10K uses two PCIe x8 slots ("bifurcation") instead of one PCIe x16 slot. This seemed awkward to me initially but now I wonder if it was sound engineering to provision additional PCIe protocol/silicon resources that are needed to achieve 100G line rate in practice? This would put things into a different light.)

lukego avatar Sep 11 '16 12:09 lukego

Do I understand correctly that these are full duplex receive and transmit tests, and that they are being limited by the transmit side because of the non-posted semantics of the way the NIC is using the PCIe bus?

wingo avatar Sep 11 '16 12:09 wingo

No and maybe, in that order ;-). This is transmit-only (packetblaster) and this root cause is not confirmed yet, just idea de jour.

Some more details of the setup over in #1007.

lukego avatar Sep 11 '16 12:09 lukego

The NIC probably has to use posted requests here - read request needs a reply to get the data - but maybe it needs to make twice as many requests in parallel to achieve robust performance.

lukego avatar Sep 11 '16 13:09 lukego

@wingo A direct analogy here is if your computer couldn't take advantage of your fast internet connection because it had an old operating system that advertises a small TCP window. Then you would only reach the advertised speed if latency is low e.g. downloading from a mirror at your ISP. Over longer distances there would not be enough packets in flight to keep the pipe full.

Anyway, just a theory, fun if it were true...

lukego avatar Sep 11 '16 13:09 lukego

A few things.

  1. Pretty much all devices support "PCIe Extended Tags" which add a few more bits so you can have alot more transactions in flight at any one time. E.g. think about GPU`s reading crap from system memory ... nvidia, intel & co have a lot of smart ppl working on this.

  2. In practice you`ll run out of PCIe credits first. This is a flow control / throttling mechanism that allows the PCIe UnCore to throttle the data rate, so the UnCore never drops a request. For both Posted & Non-Posted requests, it gets split further into credits for headers and credits for data.

  3. Latency is closer to 500ns RTT last time I checked, putting it half that going one way. Keep in mind for non-posted reads from system DDR its a request to PCIe UnCore, then response so full RTT is more appropriate. Of course these are fully pipelined requests so 211ns sounds close.

For 100G packet capture we dont care about latency much, just maximum throughput thus havent dug around there much. We`ll add full nano accurate 100G line rate PCAP replay in a few months at which point latency and maximum non-posted read bandwidth becomes important.

  1. All of this is pretty easy test with an fpga. Problem is I dont have time mess around with this at the moment.

fmadio avatar Sep 12 '16 00:09 fmadio

@fmadio Thanks for the info! I am struck that "networks are networks" and all these PCIe knobs seem to have direct analogies in TCP. "Extended tags" is window scaling, "credits" is advertised window, bandwidth*delay=parallel constraint is the same. Sensible defaults change over time too e.g. you probably don't want to use Windows XP default TCP settings for a 1 Gbps home internet connection. (Sorry, I am always overdoing analogies.)

So based on the info from @fmadio it sounds like my theory from the weekend may not actually fit the constants but let's go down the PCIe rabbit hole and find out anyway.

I have poked around the PCIe specification and found a bunch of tunables but no performance breakthrough yet.

Turns out that lspci can tell us a lot about how the device is setup:

# lspci -s 03:00.0 -vvvv
...
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
        MaxPayload 256 bytes, MaxReadReq 512 bytes
...

Observations:

  • ExtTag (PCIe extended tags) is supported by the device in DevCap.
  • ExtTag is however disabled in DevCtl.
  • More interesting-looking tunables exist in DevCap: MaxPayload and MaxReadReq.

I have tried a few different settings (e.g. ExtTag+ and MaxPayload=512 and MaxReadReq=4096) but I have not observed any impact on throughput.

I would like to check if we are running out of "credits" and that is limiting parallelism. I suppose this depends on how much buffer space the processor uncore is making available to the device. Guess the first place to look is the CPU datasheet.

I suppose that it would be handy to have a PCIe sniffer at this point. Then we could simply observe the number of parallel requests that are in flight. I wonder if there is open source Verilog/VHDL code for a PCIe sniffer? I could easily be tempted to buy a generic FPGA for this kind of activity but a single-purpose PCIe sniffer seems like overkill. Anyway - I reckon we will be able to extract enough information from the uncore performance counters in practice.

BTW lspci continues with more parameters that may also include something relevant:

DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
         Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

lukego avatar Sep 12 '16 10:09 lukego

Just a note about the server that I am using for testing here (lugano-3.snabb.co):

  • CPU: E3-1650v3 (6 cores @ 3.5 GHz, 15MB L3 cache)
  • NIC: 2 x ConnectX4 100GbE (each in a separate PCIe 3.0 x16 slot)
  • RAM: 4 x 8GB DDR4 (2133 MHz)

Could be that we could learn interesting things by testing both 100G ports in parallel. Just briefly I tested with 60B and 1500B packets. In both cases the traffic is split around 50/50 between ports. On 1500B I see aggregate 10.75 Mpps (well above single-port rate of ~6.3 Mpps) and on 60B I see aggregate 76.2 Mpps (only modestly above the single-port rate of 68 Mpps).

lukego avatar Sep 12 '16 10:09 lukego

HDL these days is almost entirely packet based, all the flow control and processing inside those fancy asic`s are mostly packet based. So all the same algos are there, different names and format but a packet is still a packet regardless of it contains a TCP header or QPI header.

Surprised the devices shows up as x16, means you`ve got PLX chip there somewhere acting as bridge. It should be 2 separate and distinct pcie devices.

You can`t just make a PCIe sniffer, a bridge would be easier. You realize a Oscilloscope that capable of sampling PCIe3 signals will cost $100-$500K ? those things are dam expensive. PCIe sniffer will "only" cost a meager $100K+ USD.

On the FPGA side monitoring the credits is pretty trivial. Forget if the Intel PCM kit has anything about PCIe Credits or monitoring Uncore PCIe FIFO sizes. Its probably there somewhere, so if you find something would be very cool to share.

fmadio avatar Sep 12 '16 12:09 fmadio

@fmadio ucevent has a mouth-watering list of events and metrics. I am working out which ones are actually supported on my processor and to untangle CPU/QPI/PCIe ambiguities. Do any happen to catch your eye? This area is obscure enough that googling doesn't yield much :-).

lukego avatar Sep 12 '16 12:09 lukego

That is incredibly shiny.

kbara avatar Sep 12 '16 13:09 kbara

wow im blinded by the shinyness, very cool.

R2PCIe.* looks interesting, would have to read the manual to work out what each actually mean.

fmadio avatar Sep 12 '16 13:09 fmadio

Coming full circle here for a moment, the actionable thing is that I want to decide how to write the general purpose transmit/receive routines for the driver:

  1. Optimize for CPU-efficiency.
  2. Optimize for PCIe-efficiency (w/ extra work for the CPU).
  3. Something else e.g. complicated heuristics, knobs, etc.

The default choice seems to be (1). However this may not really provide more than ~70G of dependable bandwidth. It would be nice to have more than this with a 100G NIC.

If the source of this limit could be clearly identified then it may be reasonable to work around it in software e.g. with extra memory access to ensure that the TX path is always served from L3 cache. However, without a clear picture this could easily do more harm that good, e.g. by taking cache resources away from the receive path that I have not benchmarked yet.

Mellanox's own code seems to be more along the lines of (3) but the motivations are not completely transparent to me. I am told that they can achieve up to 84 Mpps with 64B packets but I am not sure what this means e.g. if performance drops steeply when switching from 64B to 65B packets. (The utility of 64B packet benchmarks is when they show "worst case scenario" performance but in my tests so far this seems more like an unrepresentatively easy workload for the NIC which may be limited by I/O transfers rather than per-packet processing.)

lukego avatar Sep 13 '16 10:09 lukego