test-plans icon indicating copy to clipboard operation
test-plans copied to clipboard

Benchmarking feedback/notes

Open MarcoPolo opened this issue 2 years ago • 4 comments

Spent a bit of time looking at some parts of the benchmarking setup, and had a couple of notes and comments:

  • I think we're using iperf wrong. We are using the data from the sender, we should be looking at the receiver. Notice how the Bitrate is very different for sender vs receiver in this example:
$ iperf -c 127.0.0.1 -u -b 10g
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 50191 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-1.00   sec  1.16 GBytes  10.0 Gbits/sec  38132  
[  5]   1.00-2.00   sec  1.16 GBytes  10.0 Gbits/sec  38161  
[  5]   2.00-3.00   sec  1.16 GBytes  9.99 Gbits/sec  38106  
[  5]   3.00-4.00   sec  1.17 GBytes  10.0 Gbits/sec  38184  
[  5]   4.00-5.00   sec  1.16 GBytes  10.0 Gbits/sec  38151  
[  5]   5.00-6.00   sec  1.16 GBytes  10.0 Gbits/sec  38143  
[  5]   6.00-7.00   sec  1.16 GBytes  9.99 Gbits/sec  38114  
[  5]   7.00-8.00   sec  1.16 GBytes  10.0 Gbits/sec  38165  
[  5]   8.00-9.00   sec  1.16 GBytes  9.99 Gbits/sec  38140  
[  5]   9.00-10.00  sec  1.16 GBytes  10.0 Gbits/sec  38169  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  11.6 GBytes  10.0 Gbits/sec  0.000 ms  0/381465 (0%)  sender
[  5]   0.00-10.00  sec  8.43 GBytes  7.24 Gbits/sec  0.023 ms  105024/381411 (28%)  receiver

We need to use the bitrate on the receiver side. The sender can push as much data as you want, but for these measurements we care about the data that was actually received. Look at the difference here: https://github.com/libp2p/test-plans/actions/runs/5466146370/jobs/9950640038#step:12:29

  • The hypothetical max for this use case should be 50% of the instance bandwidth according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html. (3.4 Gbps). I think it's worth linking this doc somewhere.
  • The "local" vs "remote" backends are a bit confusing. These are both running on AWS hardware. Could we consolidate them or rename them? I would suggest alternate names, but I don't really understand them.
  • What's the ami of the short-lived module? Doesn't seem set, and I can't find the default
  • Should we make sure to set the MTU to 1500? (This might not be the default)
  • Do we need to bump the UDP send window as well? I'm not sure, but it might be fine since quic-go doesn't complain about it. Any insight here @marten-seemann?
  • Can we add comments around the AMI ids to describe them? It wasn't clear that these were the Amazon Linux AMIs
    • Maybe include this one-liner:
aws ec2 describe-images \
    --image-id ami-06e46074ae430fba6 \
    --query "Images[*].Description[]" \
    --output text \
    --region us-east-1

MarcoPolo avatar Jul 05 '23 15:07 MarcoPolo

cc @mxinden

MarcoPolo avatar Jul 05 '23 15:07 MarcoPolo

Regarding iperf: Thank you for digging into this, @MarcoPolo!

It would be really useful to have an iperf on TCP for comparison, as I've asked for in my review last month. UDP and TCP shouldn't differ by too much.

  • Do we need to bump the UDP send window as well? I'm not sure, but it might be fine since quic-go doesn't complain about it. Any insight here @marten-seemann?

It certainly won't hurt. quic-go forces an increase in buffer size (thanks to your PR: https://github.com/quic-go/quic-go/pull/3804), if run with sufficient permissions. I'm not sure if iperf does the same. The cost of running two sysctl commands during setup seems low, and we'll probably achieve a more reproducible setup. I'd recommend setting it to 10 MB each. We might need to tweak TCP (flow control, congestion control?) window sizes as well, depending on the iperf / TCP result.

marten-seemann avatar Jul 05 '23 17:07 marten-seemann

Thank you @MarcoPolo!

I am sorry for the delay. I am currently focusing on https://github.com/protocol/bifrost-infra/issues/2622. I have not forgotten about this issue.

mxinden avatar Jul 11 '23 07:07 mxinden

Documenting progress thus far:

We need to use the bitrate on the receiver side. The sender can push as much data as you want, but for these measurements we care about the data that was actually received. Look at the difference here: https://github.com/libp2p/test-plans/actions/runs/5466146370/jobs/9950640038#step:12:29

Good call-out. Thank you. Addressed in https://github.com/libp2p/test-plans/pull/241.

The hypothetical max for this use case should be 50% of the instance bandwidth according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html. (3.4 Gbps). I think it's worth linking this doc somewhere.

I think the relevant limit in our case is the single flow limit of 5gbit/s. That is also what we see in https://github.com/libp2p/test-plans/pull/276.

The "local" vs "remote" backends are a bit confusing. These are both running on AWS hardware. Could we consolidate them or rename them? I would suggest alternate names, but I don't really understand them.

I don't have an opinion on these names. If someone can come up with a better name, please post here and I will change it. Until then, I treat this as a low priority.

What's the ami of the short-lived module? Doesn't seem set, and I can't find the default

https://github.com/libp2p/test-plans/blob/483c19ca2ce48e340f76070d5b3116869dd8a0df/perf/terraform/configs/local/terraform.tf#L64-L88

Do we need to bump the UDP send window as well? I'm not sure, but it might be fine since quic-go doesn't complain about it. Any insight here @marten-seemann?

:+1: done in https://github.com/libp2p/test-plans/pull/254.

mxinden avatar Aug 31 '23 15:08 mxinden