quinn
quinn copied to clipboard
feat: Faster UDP/IO on Apple platforms
This uses Apple's private sendmsg_x and recvmsg_x system calls for multi-packet UDP I/O.
CC @mxinden
Is there interest in seeing TX support via sendmsg_x?
We found there wasn't much performance benefit, and was considerable difficulty taking advantage of, sendmmsg-style batching. IIRC the _x functions on macOS have more to offer than that, though. Will this unblock segmentation offload or other incidental optimizations?
Bench on main:
test large_data_10_streams ... bench: 27,558,791 ns/iter (+/- 13,459,810) = 380 MB/s
test large_data_1_stream ... bench: 24,324,266 ns/iter (+/- 19,219,937) = 43 MB/s
test small_data_100_streams ... bench: 19,437,900 ns/iter (+/- 20,065,941)
test small_data_1_stream ... bench: 11,465,128 ns/iter (+/- 8,699,934)
Bench with this PR:
test large_data_10_streams ... bench: 28,829,216 ns/iter (+/- 15,924,956) = 363 MB/s
test large_data_1_stream ... bench: 14,354,999 ns/iter (+/- 20,039,122) = 73 MB/s
test small_data_100_streams ... bench: 14,061,741 ns/iter (+/- 17,311,517)
test small_data_1_stream ... bench: 19,194,441 ns/iter (+/- 5,012,070)
Surprised that large_data_10_streams and small_data_1_stream are slower...
Those tests tend to be extremely noisy, as the huge variance suggests. A targeted quinn-udp benchmark might be more useful.
We've also found on neqo that multi-packet RX without multi-packet TX has limited benefits, since the RX batch size will be very small.
I added sendmsg_x support, mostly to see what the performance difference would be. But it seems that none of the benches or tests call send with a Transmit struct where segment_size is not None?
A targeted quinn-udp benchmark might be more useful.
How about using the throughput.rs benchmark @larseggert?
https://github.com/quinn-rs/quinn/blob/main/quinn-udp/benches/throughput.rs
With @mxinden's benchmark. Baseline:
gso_true/throughput time: [58.076 ms 58.230 ms 58.387 ms]
thrpt: [171.27 MiB/s 171.73 MiB/s 172.19 MiB/s]
Only sendmsg_x:
gso_true/throughput time: [15.143 ms 15.189 ms 15.236 ms]
thrpt: [656.35 MiB/s 658.36 MiB/s 660.37 MiB/s]
change:
time: [-74.028% -73.915% -73.808%] (p = 0.00 < 0.05)
thrpt: [+281.80% +283.36% +285.04%]
Performance has improved.
Both sendmsg_x and recvmsg_x:
gso_true/throughput time: [12.632 ms 12.682 ms 12.731 ms]
thrpt: [785.46 MiB/s 788.53 MiB/s 791.61 MiB/s]
change:
time: [-78.321% -78.221% -78.112%] (p = 0.00 < 0.05)
thrpt: [+356.88% +359.16% +361.27%]
Performance has improved.
Both sendmsg_x and recvmsg_x with BATCH_SIZE of 64:
gso_true/throughput time: [11.640 ms 11.682 ms 11.725 ms]
thrpt: [852.85 MiB/s 856.00 MiB/s 859.07 MiB/s]
change:
time: [-80.030% -79.938% -79.844%] (p = 0.00 < 0.05)
thrpt: [+396.13% +398.45% +400.75%]
Performance has improved.
No. They are the equivalent of the mmsg Linux calls. AFAIK Apple doesn't have GSO/GRO via the socket interface.
Are you waiting on anything from me on this?
@Ralith can you do another round on this one?
Once @mxinden's fix to the bench is in, I will rebase and squash this PR.
Hey guys 👋, is there any chance Apple won't approve apps that are using those private syscalls in the App Store? They are notorious for doing so and even de-listing apps for using anything "undocumented". See 2.5.1 here: https://developer.apple.com/app-store/review/guidelines/
See one of such cases here: https://9to5mac.com/2019/11/04/electron-app-rejections/
How they will find out? Apple employs automated tools to scan apps for the usage of private APIs. If sendmsg_x and recvmsg_x are detected, the app is at risk of being flagged.
The use of the private syscalls is now behind a non-default feature.
@larseggert should we add a big fat warning saying that if you enable this flag you will violate Apple ToS so it's only should be enabled if app is not distributed via App Store (or notarized for EU)?
I have no opinion here - we don't distribute via the App Store.
Could you make a suggestion on what you think would be good to add?
The use of the private syscalls is now behind a non-default feature.
Thank you for making it optional! It can be an issue to toggle code-paths like these with cargo features because they get aggregated across the entire dependency tree. Simply adding another dependency that also happens to depend on quinn-udp can thus activate this silently. (As a rule of thumb, cargo features should only add new APIs, not modify existing ones so this doesn't happen.)
For some prior art, curve25519-dalek had to solve similar challenges: https://github.com/dalek-cryptography/curve25519-dalek/tree/main/curve25519-dalek#backends
It looks like it is already abstracted away pretty well. Could we change this to a "plain" rustc cfg that needs to be set at build-time? I am happy for it to be opt-out. Most apps using this probably aren't in the app store and toggling it off is a one-liner in the build process. Ultimately, I got no opinion though on opt-in/opt-out.
Simply adding another dependency that also happens to depend on
quinn-udpcan thus activate this silently.
Ignoring the case where the original activation is unintentional, if that other dependency can use quinn-udp with sendmsg_x and recvmsg_x, I don't see a reason why other consumers of quinn-udp within the same binary, shouldn't use sendmsg_x and recvmsg_x. In other words, I don't think this needs to be optional per consumer of quinn-udp, but instead only optional for the entire binary. Thus I think a feature flag suffices.
Could we change this to a "plain" rustc cfg that needs to be set at build-time?
I might be missing something. Doesn't the same concern apply both to plain cargo features and rustc cfgs? Aren't rustc cfgs also global, as in apply to the entire binary?
If so, I would opt for the, in my eyes, more idiomatic cargo feature instead of the less intuitive rustc cfg.
Could we change this to a "plain" rustc cfg that needs to be set at build-time?
I might be missing something. Doesn't the same concern apply both to plain cargo features and rustc cfgs? Aren't rustc cfgs also global, as in apply to the entire binary?
If so, I would opt for the, in my eyes, more idiomatic cargo feature instead of the less intuitive rustc cfg.
The cfg flag is set at build time. Cargo features are propagated through the dependency graph so that if you add a dependency to your binary that has a feature-enabled dependency on one of your dependencies, that feature now also gets enabled for your project. This can "silently" happen without you noticing, so it's a bit of a footgun.
The
cfgflag is set at build time.
I don't understand why this is relevant. Both cargo features and rustc cfgs happen at build time, i.e. not at runtime.
Cargo features are propagated through the dependency graph so that if you add a dependency to your binary that has a feature-enabled dependency on one of your dependencies, that feature now also gets enabled for your project.
Correct. My argument above is, that this isn't a problem here. I argue that if it is safe for one consumer of quinn-udp to use sendmsg_x and recvmsg_x, it is safe for all consumers of quinn-udp within the same binary to use sendmsg_x and recvmsg_x.
Can you think of a scenario where that is not the case?
This can "silently" happen without you noticing, so it's a bit of a footgun.
In this particular case, if one doesn't trust one's dependency to make wise choices on their quinn-udp feature selection, I don't think one should use that dependency in the first place.
I am happy for it to be opt-out.
I suggest for it to be opt-in for now, given that we have little to no experience with these APIs, nor are we aware of any other applications using these APIs.
The
cfgflag is set at build time.I don't understand why this is relevant. Both cargo features and rustc cfgs happen at build time, i.e. not at runtime.
Because cfg flags are set from the command/environment where the end artifact is built, whereas Cargo features can be set by other dependencies which often aren't under your control.
Cargo features are propagated through the dependency graph so that if you add a dependency to your binary that has a feature-enabled dependency on one of your dependencies, that feature now also gets enabled for your project.
Correct. My argument above is, that this isn't a problem here. I argue that if it is safe for one consumer of
quinn-udpto usesendmsg_xandrecvmsg_x, it is safe for all consumers ofquinn-udpwithin the same binary to usesendmsg_xandrecvmsg_x.Can you think of a scenario where that is not the case?
This can "silently" happen without you noticing, so it's a bit of a footgun.
In this particular case, if one doesn't trust one's dependency to make wise choices on their
quinn-udpfeature selection, I don't think one should use that dependency in the first place.
I don't think we're going to solve the crates.io supply chain issues here, and I think not everyone is as careful as Mozilla as vetting their dependencies (and any updates happening to them over time). As such, IMO it is a best practice that additional functionality enabled via features should additionally be toggled by API choices. On the other hand, (a) I'm not sure how ergonomic that would be in this case, (b) I don't like the ergonomics of cfg as an alternative.
I am happy for it to be opt-out.
I suggest for it to be opt-in for now, given that we have little to no experience with these APIs, nor are we aware of any other applications using these APIs.
This seems reasonable to me.
The
cfgflag is set at build time.I don't understand why this is relevant. Both cargo features and rustc cfgs happen at build time, i.e. not at runtime.
Because
cfgflags are set from the command/environment where the end artifact is built, whereas Cargo features can be set by other dependencies which often aren't under your control.
Thank you for expanding on this. That makes sense.
See 4f86c5058740a9295b74f87c8cab6206a60e9971. Let me know if I misunderstood something.
With this change, how will CI cover this code? AFAIK cargo hack only iterates over feature permutations.
See 4f86c50. Let me know if I misunderstood something.
Discussed this with @Ralith privately, we feel that the use of cfg has a negative impact on the ergonomics and the chance of issues arising from silent feature enabling is pretty limited in this case, so would prefer to stick with a feature for this for now.
Sorry it took a while to get the benchmark PR merged, can you rebase on top of main?
Shouldn't the features CI workflow have included fast-apple-datapath? I don't see it in the logs. (Or I am missing something.)
Should be fine I believe:
running `cargo check --no-default-features --features direct-log,fast-apple-datapath,log,tracing` on quinn-udp (1754/1772)
running `cargo check --no-default-features --features default` on quinn-udp (1755/1772)
running `cargo check --no-default-features --features direct-log` on quinn-udp (1756/1772)
running `cargo check --no-default-features --features default,direct-log` on quinn-udp (1757/1772)
running `cargo check --no-default-features --features fast-apple-datapath` on quinn-udp (1758/1772)
running `cargo check --no-default-features --features default,fast-apple-datapath` on quinn-udp (1759/1772)
running `cargo check --no-default-features --features direct-log,fast-apple-datapath` on quinn-udp (1760/1772)
running `cargo check --no-default-features --features default,direct-log,fast-apple-datapath` on quinn-udp (1761/1772)
running `cargo check --no-default-features --features log` on quinn-udp (1762/1772)
running `cargo check --no-default-features --features direct-log,log` on quinn-udp (1763/1772)
running `cargo check --no-default-features --features fast-apple-datapath,log` on quinn-udp (1764/1772)
running `cargo check --no-default-features --features direct-log,fast-apple-datapath,log` on quinn-udp (1765/1772)
running `cargo check --no-default-features --features tracing` on quinn-udp (1766/1772)
running `cargo check --no-default-features --features direct-log,tracing` on quinn-udp (1767/1772)
running `cargo check --no-default-features --features fast-apple-datapath,tracing` on quinn-udp (1768/1772)
running `cargo check --no-default-features --features direct-log,fast-apple-datapath,tracing` on quinn-udp (1769/1772)
running `cargo check --no-default-features --features log,tracing` on quinn-udp (1770/1772)
running `cargo check --no-default-features --features direct-log,log,tracing` on quinn-udp (1771/1772)
running `cargo check --no-default-features --features fast-apple-datapath,log,tracing` on quinn-udp (1772/1772)
For those interested in using the fast-apple-datapath feature:
- All MacOS Firefox Nightly users are now using it.
- We see ~11% performance improvement on our localhost CPU bound throughput benchmark.
- Every other (50th percentile) QUIC UDP
recvmsg_xcall reads 2 datagrams. - We are soon rolling it out to all Firefox Beta users. Once that succeeds it will make it into the Firefox release channel.
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_received/explore?aggType=avg&os=Darwin&timeHorizon=WEEK&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D