behavioral-model icon indicating copy to clipboard operation
behavioral-model copied to clipboard

Approximate conversion of bmv2 p4 performance to hardware performance

Open RithvikChuppala opened this issue 11 months ago • 4 comments

I know that bmv2's performance is not production-grade and there are a lot of hardware-dependent factors but is there some approximate performance conversion factor or methodology to gauge the relative performance of the packet programs from bmv2's software switch to a production hardware switch? Something like clock cycles, CPU utilization, etc?

RithvikChuppala avatar Mar 13 '24 20:03 RithvikChuppala

It really depends upon the production switch you have in mind.

For example, Tofino's hardware architecture is such that at a basic introductory level you can say it has the following performance model:

  • If your P4 program fits into one pass, it operates at X billion packets per second of throughput, guaranteed
  • If your P4 program does not fit into one pass, it operates at 0 packets per second of throughput, guaranteed

Now of course you can get more nuanced than that, by allowing P4 programs that explicitly recirculate packets, and have other operating points like this:

  • If your P4 program fits into K passes, it operates at X billion packets per second, but only for a fraction 1/K of the ports being usable, with the rest being dedicated as recirculation ports

There are other hardware architectures where the performance will degrade more gradually than that, if you go "a little bit over" the budget of what can be done at X billion packets per second.

Some will have caches between the packet processing core and DRAM, and then cache hit rates play a huge part in the throughput and latency.

Sorry I can't give you a more specific answer, but if you dive at least a bit into two different-enough hardware architectures, you will start to see more of the reasons that "it depends" is the correct answer.

jafingerhut avatar Mar 13 '24 21:03 jafingerhut

Thanks for the quick reply!

For my use case, I'm implementing packet processing functionality to perform tunneling (stripping tunnel headers, adding new egress headers, etc). I aim to show that executing this packet process functionality in a programmable switch improves throughput and latency metrics compared to the normal software-based approach.

However, since bmv2 isn't an accurate representation of performance, what proxy metric for ideal hardware performance do you think makes the most sense?

RithvikChuppala avatar Mar 13 '24 21:03 RithvikChuppala

If you can, the truly best measure is to implement it and measure the relevant performance metrics on a real hardware device.

If for some reason that is not possible, then the next best thing is to learn about some hardware device in enough detail that you can make a good educated guess what the performance metrics would be.

jafingerhut avatar Mar 13 '24 23:03 jafingerhut

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

github-actions[bot] avatar Sep 10 '24 00:09 github-actions[bot]