Deprecate XorBytes
Now that we no longer support Go 1.19, we might as well deprecate XorBytes, remove the ARM-32 assembler version, and use the stdlib version instead.
go.mod | 2 +-
utils/xor/{xor_generic.go => xor.go} | 4 +-
utils/xor/xor_arm.go | 60 ------------------
utils/xor/xor_arm.s | 116 -----------------------------------
utils/xor/xor_old.go | 77 -----------------------
5 files changed, 3 insertions(+), 256 deletions(-)
This should not change anything on architectures other than 32-bit ARM. The only concern is that Go does not implement assembler code on 32-bit ARM (see https://go-review.googlesource.com/c/go/+/409394, which @adriancable never finished). The XorBytes benchmarks look pretty horrible (note that only the aligned versions are relevant to Pion):
name old time/op new time/op delta
XORAligned/8Bytes 107ns ± 0% 307ns ± 0% +187.59% (p=0.008 n=5+5)
XORAligned/128Bytes 148ns ± 0% 504ns ± 0% +239.53% (p=0.008 n=5+5)
XORAligned/2048Bytes 1.35µs ± 0% 3.43µs ± 0% +153.87% (p=0.016 n=5+4)
XORAligned/32768Bytes 22.1µs ± 1% 58.7µs ± 1% +165.86% (p=0.008 n=5+5)
XORUnalignedDst/8Bytes 116ns ± 0% 227ns ± 0% +95.78% (p=0.008 n=5+5)
XORUnalignedDst/128Bytes 164ns ± 0% 972ns ± 0% +494.43% (p=0.008 n=5+5)
XORUnalignedDst/2048Bytes 1.98µs ± 0% 12.68µs ± 0% +541.67% (p=0.016 n=4+5)
XORUnalignedDst/32768Bytes 32.7µs ± 1% 209.6µs ± 1% +540.19% (p=0.008 n=5+5)
name old speed new speed delta
XORAligned/8Bytes 75.0MB/s ± 0% 26.1MB/s ± 0% -65.23% (p=0.008 n=5+5)
XORAligned/128Bytes 862MB/s ± 0% 254MB/s ± 0% -70.55% (p=0.008 n=5+5)
XORAligned/2048Bytes 1.52GB/s ± 0% 0.60GB/s ± 0% -60.61% (p=0.016 n=5+4)
XORAligned/32768Bytes 1.49GB/s ± 1% 0.56GB/s ± 1% -62.39% (p=0.008 n=5+5)
XORUnalignedDst/8Bytes 69.2MB/s ± 0% 35.3MB/s ± 0% -48.97% (p=0.016 n=4+5)
XORUnalignedDst/128Bytes 782MB/s ± 0% 132MB/s ± 0% -83.18% (p=0.008 n=5+5)
XORUnalignedDst/2048Bytes 1.04GB/s ± 0% 0.16GB/s ± 0% -84.42% (p=0.016 n=4+5)
XORUnalignedDst/32768Bytes 1.00GB/s ± 1% 0.16GB/s ± 1% -84.38% (p=0.008 n=5+5)
However, 32-bit ARM does not implement hardware crypto, so once you factor in the cost of encryption, the difference is just 10% for CTR (which is seldom negotiated nowadays), and, as expected, no difference for GCM (which is used by modern browsers):
name old time/op new time/op delta
GenerateCounter 175ns ± 0% 192ns ± 0% +9.91% (p=0.008 n=5+5)
EncryptRTP/CTR-100 28.8µs ± 1% 30.7µs ± 0% +6.47% (p=0.008 n=5+5)
EncryptRTP/CTR-1000 132µs ± 1% 147µs ± 1% +11.35% (p=0.008 n=5+5)
EncryptRTP/GCM-100 35.9µs ± 5% 35.6µs ± 1% ~ (p=0.841 n=5+5)
EncryptRTP/GCM-1000 214µs ± 1% 213µs ± 0% ~ (p=0.095 n=5+5)
EncryptRTPInPlace/CTR-100 26.8µs ± 1% 28.4µs ± 0% +5.70% (p=0.008 n=5+5)
EncryptRTPInPlace/CTR-1000 120µs ± 2% 133µs ± 0% +10.85% (p=0.008 n=5+5)
EncryptRTPInPlace/GCM-100 33.5µs ± 1% 33.6µs ± 1% ~ (p=0.151 n=5+5)
EncryptRTPInPlace/GCM-1000 199µs ± 1% 199µs ± 1% ~ (p=0.421 n=5+5)
DecryptRTP/CTR 18.7µs ± 1% 20.1µs ± 1% +7.55% (p=0.008 n=5+5)
DecryptRTP/GCM 16.8µs ± 0% 17.1µs ± 2% +2.10% (p=0.008 n=5+5)
Write/CTR-100 31.4µs ± 0% 33.5µs ± 1% +6.75% (p=0.008 n=5+5)
Write/CTR-1000 124µs ± 1% 138µs ± 1% +11.53% (p=0.008 n=5+5)
Write/GCM-100 37.5µs ± 1% 38.0µs ± 0% +1.41% (p=0.008 n=5+5)
Write/GCM-1000 203µs ± 0% 204µs ± 0% +0.24% (p=0.032 n=5+5)
WriteRTP/CTR-100 27.1µs ± 2% 29.6µs ± 1% +9.23% (p=0.008 n=5+5)
WriteRTP/CTR-1000 120µs ± 1% 134µs ± 1% +12.38% (p=0.008 n=5+5)
WriteRTP/GCM-100 34.0µs ± 1% 33.8µs ± 1% ~ (p=0.222 n=5+5)
WriteRTP/GCM-1000 200µs ± 1% 198µs ± 0% -0.65% (p=0.016 n=5+4)
name old speed new speed delta
EncryptRTP/CTR-100 3.88MB/s ± 1% 3.65MB/s ± 0% -6.08% (p=0.008 n=5+5)
EncryptRTP/CTR-1000 7.66MB/s ± 1% 6.88MB/s ± 1% -10.19% (p=0.008 n=5+5)
EncryptRTP/GCM-100 3.12MB/s ± 5% 3.15MB/s ± 1% ~ (p=0.754 n=5+5)
EncryptRTP/GCM-1000 4.72MB/s ± 1% 4.75MB/s ± 0% ~ (p=0.206 n=5+5)
EncryptRTPInPlace/CTR-100 4.17MB/s ± 1% 3.95MB/s ± 0% -5.37% (p=0.008 n=5+5)
EncryptRTPInPlace/CTR-1000 8.47MB/s ± 0% 7.61MB/s ± 0% -10.13% (p=0.016 n=4+5)
EncryptRTPInPlace/GCM-100 3.35MB/s ± 1% 3.33MB/s ± 1% ~ (p=0.214 n=5+5)
EncryptRTPInPlace/GCM-1000 5.09MB/s ± 1% 5.08MB/s ± 1% ~ (p=0.452 n=5+5)
DecryptRTP/CTR 1.50MB/s ± 1% 1.39MB/s ± 0% -7.21% (p=0.000 n=5+4)
DecryptRTP/GCM 2.03MB/s ± 0% 1.98MB/s ± 2% -2.07% (p=0.008 n=5+5)
Write/CTR-100 3.57MB/s ± 0% 3.34MB/s ± 1% -6.44% (p=0.016 n=4+5)
Write/CTR-1000 8.17MB/s ± 1% 7.33MB/s ± 1% -10.35% (p=0.008 n=5+5)
Write/GCM-100 2.99MB/s ± 1% 2.95MB/s ± 0% -1.40% (p=0.016 n=5+5)
Write/GCM-1000 4.98MB/s ± 0% 4.97MB/s ± 0% -0.28% (p=0.048 n=5+5)
WriteRTP/CTR-100 4.13MB/s ± 2% 3.78MB/s ± 1% -8.48% (p=0.008 n=5+5)
WriteRTP/CTR-1000 8.46MB/s ± 0% 7.53MB/s ± 1% -11.01% (p=0.008 n=5+5)
WriteRTP/GCM-100 3.30MB/s ± 0% 3.31MB/s ± 1% ~ (p=0.183 n=5+5)
WriteRTP/GCM-1000 5.07MB/s ± 1% 5.10MB/s ± 0% ~ (p=0.079 n=5+4)
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 82.99%. Comparing base (
e8ac9ae) to head (e62d789).
Additional details and impacted files
@@ Coverage Diff @@
## master #315 +/- ##
==========================================
- Coverage 83.03% 82.99% -0.04%
==========================================
Files 39 39
Lines 2729 2729
==========================================
- Hits 2266 2265 -1
- Misses 335 336 +1
Partials 128 128
| Flag | Coverage Δ | |
|---|---|---|
| go | 82.90% <ø> (-0.04%) |
:arrow_down: |
| wasm | 65.02% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@jech - it's a shame that the Go team weren't interested in picking up my PR for XorBytes. It just stalled after I completed the submission and as far as I can tell it was never reviewed. I would be happy to hear any suggestions you may have to get it picked up.
Our application (which runs on 32-bit ARM, still very common for embedded systems) is connecting to IoT cameras, many of which are old and most of which use CTR. This PR would decrease performance in those cases and we would have to switch to a fork of pion with optimized XorBytes manually add back in. I'm generally against performance regressions so I don't think this PR should be merged.
Please check the benchmarks, Adrian.
The patch causes a 10% regression in WriteRTP. Assuming your code spends 40% of its CPU doing WriteRTP, this is a 4% regression to your code's performance, which should be negligible.
Perhaps you could share some benchmarks of your application that show whether the change is significant?
@jech - my code actually spends a lot less than 40% CPU on WriteRTP (this is used only for the camera talkback function), but it spends almost all of its CPU time on ReadRTP of which most of DecryptRTP. The WebRTC receiver is more or less a tight ReadRTP loop and on our platform (Allwinner H3) a single camera consumes 14.0% of one core in processing. We have a budget of 1.5 cores to support as many cameras as we can (the rest is needed for non-WebRTC processing functions), which means we support up to 10 cameras streaming in parallel, consuming a total of 1.4 cores.
Without the ARM32 implemention of XorBytes, single camera CPU consumption increases to 15.2% (which matches your own benchmark closely enough), which means we need to drop the number of supported cameras from 10 to 9 to meet the total budget for WebRTC processing of 1.5 cores. In practice we wouldn't actually make things less capable in a product 'update' so we would need to maintain a separate pion fork with the XorBytes optimization back in. This is just extra work.
In summary, this PR causes a performance regression in circumstances which can be meaningful for some applications (like ours) and doesn't add any benefit, so I don't see a reason to merge it.
Thank you @jech! I should start a list of all the things that got added and we hope to remove someday. I completely forgot about this.
@adriancable as long as you need this it will not be removed! I don't want to see you on a fork.
a single camera consumes 14.0% of one core
[...]
Without the ARM32 implemention of XorBytes, single camera CPU consumption increases to 15.2%
These figures make sense.
Adrian, may I encourage you to push https://go-review.googlesource.com/c/go/+/409394 again? It would be great if we could remove some code from Pion.
@jech - yes, I will happily give it another go. I would love to see this merged into Go - that is where it really should live, rather than Pion.
doesn't add any benefito
The xor_asm.s file contains assembler code which, as far as I know, has never been properly reviewed by anyone other than the author (please correct me if I'm wrong). If there is a security issue in that code, it will endanger the whole Pion community, not just your application.
Moving this code into the Go stdlib, will ensure that it gets reviewed. Failing that, removing it altogether is the safe way out.
Hi @jech - the code in xor_asm.s (as something written specifically for Pion) has been reviewed in much the same way as all code written for the Pion code base has been. In general though, I would expect that Pion PRs get fewer eyes on them than Go PRs, which is somewhat inevitable given that we have a much smaller contributor base than Go has, but still want to maintain reasonable development velocity.
Security issues are definitely a valid concern but I don't think apply more to this piece of code in Pion than other contributions, beyond the fact I suppose that being written in ARM32 assembly language makes it a little more 'esoteric' than most contributions.