quinn icon indicating copy to clipboard operation
quinn copied to clipboard

feat: Faster UDP/IO on Apple platforms

Open larseggert opened this issue 1 year ago • 23 comments

This uses Apple's private sendmsg_x and recvmsg_x system calls for multi-packet UDP I/O.

CC @mxinden

larseggert avatar Sep 20 '24 10:09 larseggert

Is there interest in seeing TX support via sendmsg_x?

We found there wasn't much performance benefit, and was considerable difficulty taking advantage of, sendmmsg-style batching. IIRC the _x functions on macOS have more to offer than that, though. Will this unblock segmentation offload or other incidental optimizations?

Ralith avatar Sep 20 '24 19:09 Ralith

Bench on main:

test large_data_10_streams  ... bench:  27,558,791 ns/iter (+/- 13,459,810) = 380 MB/s
test large_data_1_stream    ... bench:  24,324,266 ns/iter (+/- 19,219,937) = 43 MB/s
test small_data_100_streams ... bench:  19,437,900 ns/iter (+/- 20,065,941)
test small_data_1_stream    ... bench:  11,465,128 ns/iter (+/- 8,699,934)

Bench with this PR:

test large_data_10_streams  ... bench:  28,829,216 ns/iter (+/- 15,924,956) = 363 MB/s
test large_data_1_stream    ... bench:  14,354,999 ns/iter (+/- 20,039,122) = 73 MB/s
test small_data_100_streams ... bench:  14,061,741 ns/iter (+/- 17,311,517)
test small_data_1_stream    ... bench:  19,194,441 ns/iter (+/- 5,012,070)

Surprised that large_data_10_streams and small_data_1_stream are slower...

larseggert avatar Sep 23 '24 06:09 larseggert

Those tests tend to be extremely noisy, as the huge variance suggests. A targeted quinn-udp benchmark might be more useful.

Ralith avatar Sep 23 '24 06:09 Ralith

We've also found on neqo that multi-packet RX without multi-packet TX has limited benefits, since the RX batch size will be very small.

larseggert avatar Sep 23 '24 06:09 larseggert

I added sendmsg_x support, mostly to see what the performance difference would be. But it seems that none of the benches or tests call send with a Transmit struct where segment_size is not None?

larseggert avatar Sep 23 '24 07:09 larseggert

A targeted quinn-udp benchmark might be more useful.

How about using the throughput.rs benchmark @larseggert?

https://github.com/quinn-rs/quinn/blob/main/quinn-udp/benches/throughput.rs

mxinden avatar Sep 23 '24 09:09 mxinden

With @mxinden's benchmark. Baseline:

gso_true/throughput     time:   [58.076 ms 58.230 ms 58.387 ms]
                        thrpt:  [171.27 MiB/s 171.73 MiB/s 172.19 MiB/s]

Only sendmsg_x:

gso_true/throughput     time:   [15.143 ms 15.189 ms 15.236 ms]
                        thrpt:  [656.35 MiB/s 658.36 MiB/s 660.37 MiB/s]
                 change:
                        time:   [-74.028% -73.915% -73.808%] (p = 0.00 < 0.05)
                        thrpt:  [+281.80% +283.36% +285.04%]
                        Performance has improved.

Both sendmsg_x and recvmsg_x:

gso_true/throughput     time:   [12.632 ms 12.682 ms 12.731 ms]
                        thrpt:  [785.46 MiB/s 788.53 MiB/s 791.61 MiB/s]
                 change:
                        time:   [-78.321% -78.221% -78.112%] (p = 0.00 < 0.05)
                        thrpt:  [+356.88% +359.16% +361.27%]
                        Performance has improved.

Both sendmsg_x and recvmsg_x with BATCH_SIZE of 64:

gso_true/throughput     time:   [11.640 ms 11.682 ms 11.725 ms]
                        thrpt:  [852.85 MiB/s 856.00 MiB/s 859.07 MiB/s]
                 change:
                        time:   [-80.030% -79.938% -79.844%] (p = 0.00 < 0.05)
                        thrpt:  [+396.13% +398.45% +400.75%]
                        Performance has improved.

larseggert avatar Sep 23 '24 12:09 larseggert

No. They are the equivalent of the mmsg Linux calls. AFAIK Apple doesn't have GSO/GRO via the socket interface.

larseggert avatar Sep 24 '24 04:09 larseggert

Are you waiting on anything from me on this?

larseggert avatar Oct 01 '24 16:10 larseggert

@Ralith can you do another round on this one?

djc avatar Oct 09 '24 14:10 djc

Once @mxinden's fix to the bench is in, I will rebase and squash this PR.

larseggert avatar Oct 10 '24 05:10 larseggert

Hey guys 👋, is there any chance Apple won't approve apps that are using those private syscalls in the App Store? They are notorious for doing so and even de-listing apps for using anything "undocumented". See 2.5.1 here: https://developer.apple.com/app-store/review/guidelines/

See one of such cases here: https://9to5mac.com/2019/11/04/electron-app-rejections/

How they will find out? Apple employs automated tools to scan apps for the usage of private APIs. If sendmsg_x and recvmsg_x are detected, the app is at risk of being flagged.

AndrewDryga avatar Oct 10 '24 17:10 AndrewDryga

The use of the private syscalls is now behind a non-default feature.

larseggert avatar Oct 10 '24 17:10 larseggert

@larseggert should we add a big fat warning saying that if you enable this flag you will violate Apple ToS so it's only should be enabled if app is not distributed via App Store (or notarized for EU)?

AndrewDryga avatar Oct 10 '24 17:10 AndrewDryga

I have no opinion here - we don't distribute via the App Store.

Could you make a suggestion on what you think would be good to add?

larseggert avatar Oct 10 '24 17:10 larseggert

The use of the private syscalls is now behind a non-default feature.

Thank you for making it optional! It can be an issue to toggle code-paths like these with cargo features because they get aggregated across the entire dependency tree. Simply adding another dependency that also happens to depend on quinn-udp can thus activate this silently. (As a rule of thumb, cargo features should only add new APIs, not modify existing ones so this doesn't happen.)

For some prior art, curve25519-dalek had to solve similar challenges: https://github.com/dalek-cryptography/curve25519-dalek/tree/main/curve25519-dalek#backends

It looks like it is already abstracted away pretty well. Could we change this to a "plain" rustc cfg that needs to be set at build-time? I am happy for it to be opt-out. Most apps using this probably aren't in the app store and toggling it off is a one-liner in the build process. Ultimately, I got no opinion though on opt-in/opt-out.

thomaseizinger avatar Oct 10 '24 20:10 thomaseizinger

Simply adding another dependency that also happens to depend on quinn-udp can thus activate this silently.

Ignoring the case where the original activation is unintentional, if that other dependency can use quinn-udp with sendmsg_x and recvmsg_x, I don't see a reason why other consumers of quinn-udp within the same binary, shouldn't use sendmsg_x and recvmsg_x. In other words, I don't think this needs to be optional per consumer of quinn-udp, but instead only optional for the entire binary. Thus I think a feature flag suffices.

Could we change this to a "plain" rustc cfg that needs to be set at build-time?

I might be missing something. Doesn't the same concern apply both to plain cargo features and rustc cfgs? Aren't rustc cfgs also global, as in apply to the entire binary?

If so, I would opt for the, in my eyes, more idiomatic cargo feature instead of the less intuitive rustc cfg.

mxinden avatar Oct 11 '24 08:10 mxinden

Could we change this to a "plain" rustc cfg that needs to be set at build-time?

I might be missing something. Doesn't the same concern apply both to plain cargo features and rustc cfgs? Aren't rustc cfgs also global, as in apply to the entire binary?

If so, I would opt for the, in my eyes, more idiomatic cargo feature instead of the less intuitive rustc cfg.

The cfg flag is set at build time. Cargo features are propagated through the dependency graph so that if you add a dependency to your binary that has a feature-enabled dependency on one of your dependencies, that feature now also gets enabled for your project. This can "silently" happen without you noticing, so it's a bit of a footgun.

djc avatar Oct 11 '24 08:10 djc

The cfg flag is set at build time.

I don't understand why this is relevant. Both cargo features and rustc cfgs happen at build time, i.e. not at runtime.

Cargo features are propagated through the dependency graph so that if you add a dependency to your binary that has a feature-enabled dependency on one of your dependencies, that feature now also gets enabled for your project.

Correct. My argument above is, that this isn't a problem here. I argue that if it is safe for one consumer of quinn-udp to use sendmsg_x and recvmsg_x, it is safe for all consumers of quinn-udp within the same binary to use sendmsg_x and recvmsg_x.

Can you think of a scenario where that is not the case?

This can "silently" happen without you noticing, so it's a bit of a footgun.

In this particular case, if one doesn't trust one's dependency to make wise choices on their quinn-udp feature selection, I don't think one should use that dependency in the first place.

I am happy for it to be opt-out.

I suggest for it to be opt-in for now, given that we have little to no experience with these APIs, nor are we aware of any other applications using these APIs.

mxinden avatar Oct 11 '24 08:10 mxinden

The cfg flag is set at build time.

I don't understand why this is relevant. Both cargo features and rustc cfgs happen at build time, i.e. not at runtime.

Because cfg flags are set from the command/environment where the end artifact is built, whereas Cargo features can be set by other dependencies which often aren't under your control.

Cargo features are propagated through the dependency graph so that if you add a dependency to your binary that has a feature-enabled dependency on one of your dependencies, that feature now also gets enabled for your project.

Correct. My argument above is, that this isn't a problem here. I argue that if it is safe for one consumer of quinn-udp to use sendmsg_x and recvmsg_x, it is safe for all consumers of quinn-udp within the same binary to use sendmsg_x and recvmsg_x.

Can you think of a scenario where that is not the case?

This can "silently" happen without you noticing, so it's a bit of a footgun.

In this particular case, if one doesn't trust one's dependency to make wise choices on their quinn-udp feature selection, I don't think one should use that dependency in the first place.

I don't think we're going to solve the crates.io supply chain issues here, and I think not everyone is as careful as Mozilla as vetting their dependencies (and any updates happening to them over time). As such, IMO it is a best practice that additional functionality enabled via features should additionally be toggled by API choices. On the other hand, (a) I'm not sure how ergonomic that would be in this case, (b) I don't like the ergonomics of cfg as an alternative.

I am happy for it to be opt-out.

I suggest for it to be opt-in for now, given that we have little to no experience with these APIs, nor are we aware of any other applications using these APIs.

This seems reasonable to me.

djc avatar Oct 11 '24 09:10 djc

The cfg flag is set at build time.

I don't understand why this is relevant. Both cargo features and rustc cfgs happen at build time, i.e. not at runtime.

Because cfg flags are set from the command/environment where the end artifact is built, whereas Cargo features can be set by other dependencies which often aren't under your control.

Thank you for expanding on this. That makes sense.

mxinden avatar Oct 11 '24 09:10 mxinden

See 4f86c5058740a9295b74f87c8cab6206a60e9971. Let me know if I misunderstood something.

With this change, how will CI cover this code? AFAIK cargo hack only iterates over feature permutations.

larseggert avatar Oct 11 '24 10:10 larseggert

See 4f86c50. Let me know if I misunderstood something.

Discussed this with @Ralith privately, we feel that the use of cfg has a negative impact on the ergonomics and the chance of issues arising from silent feature enabling is pretty limited in this case, so would prefer to stick with a feature for this for now.

djc avatar Oct 14 '24 08:10 djc

Sorry it took a while to get the benchmark PR merged, can you rebase on top of main?

djc avatar Oct 24 '24 10:10 djc

Shouldn't the features CI workflow have included fast-apple-datapath? I don't see it in the logs. (Or I am missing something.)

larseggert avatar Oct 24 '24 12:10 larseggert

Should be fine I believe:

running `cargo check --no-default-features --features direct-log,fast-apple-datapath,log,tracing` on quinn-udp (1754/1772)
running `cargo check --no-default-features --features default` on quinn-udp (1755/1772)
running `cargo check --no-default-features --features direct-log` on quinn-udp (1756/1772)
running `cargo check --no-default-features --features default,direct-log` on quinn-udp (1757/1772)
running `cargo check --no-default-features --features fast-apple-datapath` on quinn-udp (1758/1772)
running `cargo check --no-default-features --features default,fast-apple-datapath` on quinn-udp (1759/1772)
running `cargo check --no-default-features --features direct-log,fast-apple-datapath` on quinn-udp (1760/1772)
running `cargo check --no-default-features --features default,direct-log,fast-apple-datapath` on quinn-udp (1761/1772)
running `cargo check --no-default-features --features log` on quinn-udp (1762/1772)
running `cargo check --no-default-features --features direct-log,log` on quinn-udp (1763/1772)
running `cargo check --no-default-features --features fast-apple-datapath,log` on quinn-udp (1764/1772)
running `cargo check --no-default-features --features direct-log,fast-apple-datapath,log` on quinn-udp (1765/1772)
running `cargo check --no-default-features --features tracing` on quinn-udp (1766/1772)
running `cargo check --no-default-features --features direct-log,tracing` on quinn-udp (1767/1772)
running `cargo check --no-default-features --features fast-apple-datapath,tracing` on quinn-udp (1768/1772)
running `cargo check --no-default-features --features direct-log,fast-apple-datapath,tracing` on quinn-udp (1769/1772)
running `cargo check --no-default-features --features log,tracing` on quinn-udp (1770/1772)
running `cargo check --no-default-features --features direct-log,log,tracing` on quinn-udp (1771/1772)
running `cargo check --no-default-features --features fast-apple-datapath,log,tracing` on quinn-udp (1772/1772)

mxinden avatar Oct 24 '24 12:10 mxinden

For those interested in using the fast-apple-datapath feature:

  • All MacOS Firefox Nightly users are now using it.
  • We see ~11% performance improvement on our localhost CPU bound throughput benchmark.
  • Every other (50th percentile) QUIC UDP recvmsg_x call reads 2 datagrams.
  • We are soon rolling it out to all Firefox Beta users. Once that succeeds it will make it into the Firefox release channel.

image

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_received/explore?aggType=avg&os=Darwin&timeHorizon=WEEK&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

mxinden avatar Mar 09 '25 15:03 mxinden