transport Deprecate XorBytes

Now that we no longer support Go 1.19, we might as well deprecate XorBytes, remove the ARM-32 assembler version, and use the stdlib version instead.

 go.mod                               |   2 +-
 utils/xor/{xor_generic.go => xor.go} |   4 +-
 utils/xor/xor_arm.go                 |  60 ------------------
 utils/xor/xor_arm.s                  | 116 -----------------------------------
 utils/xor/xor_old.go                 |  77 -----------------------
 5 files changed, 3 insertions(+), 256 deletions(-)

This should not change anything on architectures other than 32-bit ARM. The only concern is that Go does not implement assembler code on 32-bit ARM (see https://go-review.googlesource.com/c/go/+/409394, which @adriancable never finished). The XorBytes benchmarks look pretty horrible (note that only the aligned versions are relevant to Pion):

name                        old time/op    new time/op    delta
XORAligned/8Bytes              107ns ± 0%     307ns ± 0%  +187.59%  (p=0.008 n=5+5)
XORAligned/128Bytes            148ns ± 0%     504ns ± 0%  +239.53%  (p=0.008 n=5+5)
XORAligned/2048Bytes          1.35µs ± 0%    3.43µs ± 0%  +153.87%  (p=0.016 n=5+4)
XORAligned/32768Bytes         22.1µs ± 1%    58.7µs ± 1%  +165.86%  (p=0.008 n=5+5)
XORUnalignedDst/8Bytes         116ns ± 0%     227ns ± 0%   +95.78%  (p=0.008 n=5+5)
XORUnalignedDst/128Bytes       164ns ± 0%     972ns ± 0%  +494.43%  (p=0.008 n=5+5)
XORUnalignedDst/2048Bytes     1.98µs ± 0%   12.68µs ± 0%  +541.67%  (p=0.016 n=4+5)
XORUnalignedDst/32768Bytes    32.7µs ± 1%   209.6µs ± 1%  +540.19%  (p=0.008 n=5+5)

name                        old speed      new speed      delta
XORAligned/8Bytes           75.0MB/s ± 0%  26.1MB/s ± 0%   -65.23%  (p=0.008 n=5+5)
XORAligned/128Bytes          862MB/s ± 0%   254MB/s ± 0%   -70.55%  (p=0.008 n=5+5)
XORAligned/2048Bytes        1.52GB/s ± 0%  0.60GB/s ± 0%   -60.61%  (p=0.016 n=5+4)
XORAligned/32768Bytes       1.49GB/s ± 1%  0.56GB/s ± 1%   -62.39%  (p=0.008 n=5+5)
XORUnalignedDst/8Bytes      69.2MB/s ± 0%  35.3MB/s ± 0%   -48.97%  (p=0.016 n=4+5)
XORUnalignedDst/128Bytes     782MB/s ± 0%   132MB/s ± 0%   -83.18%  (p=0.008 n=5+5)
XORUnalignedDst/2048Bytes   1.04GB/s ± 0%  0.16GB/s ± 0%   -84.42%  (p=0.016 n=4+5)
XORUnalignedDst/32768Bytes  1.00GB/s ± 1%  0.16GB/s ± 1%   -84.38%  (p=0.008 n=5+5)

However, 32-bit ARM does not implement hardware crypto, so once you factor in the cost of encryption, the difference is just 10% for CTR (which is seldom negotiated nowadays), and, as expected, no difference for GCM (which is used by modern browsers):

name                        old time/op    new time/op    delta
GenerateCounter                175ns ± 0%     192ns ± 0%   +9.91%  (p=0.008 n=5+5)
EncryptRTP/CTR-100            28.8µs ± 1%    30.7µs ± 0%   +6.47%  (p=0.008 n=5+5)
EncryptRTP/CTR-1000            132µs ± 1%     147µs ± 1%  +11.35%  (p=0.008 n=5+5)
EncryptRTP/GCM-100            35.9µs ± 5%    35.6µs ± 1%     ~     (p=0.841 n=5+5)
EncryptRTP/GCM-1000            214µs ± 1%     213µs ± 0%     ~     (p=0.095 n=5+5)
EncryptRTPInPlace/CTR-100     26.8µs ± 1%    28.4µs ± 0%   +5.70%  (p=0.008 n=5+5)
EncryptRTPInPlace/CTR-1000     120µs ± 2%     133µs ± 0%  +10.85%  (p=0.008 n=5+5)
EncryptRTPInPlace/GCM-100     33.5µs ± 1%    33.6µs ± 1%     ~     (p=0.151 n=5+5)
EncryptRTPInPlace/GCM-1000     199µs ± 1%     199µs ± 1%     ~     (p=0.421 n=5+5)
DecryptRTP/CTR                18.7µs ± 1%    20.1µs ± 1%   +7.55%  (p=0.008 n=5+5)
DecryptRTP/GCM                16.8µs ± 0%    17.1µs ± 2%   +2.10%  (p=0.008 n=5+5)
Write/CTR-100                 31.4µs ± 0%    33.5µs ± 1%   +6.75%  (p=0.008 n=5+5)
Write/CTR-1000                 124µs ± 1%     138µs ± 1%  +11.53%  (p=0.008 n=5+5)
Write/GCM-100                 37.5µs ± 1%    38.0µs ± 0%   +1.41%  (p=0.008 n=5+5)
Write/GCM-1000                 203µs ± 0%     204µs ± 0%   +0.24%  (p=0.032 n=5+5)
WriteRTP/CTR-100              27.1µs ± 2%    29.6µs ± 1%   +9.23%  (p=0.008 n=5+5)
WriteRTP/CTR-1000              120µs ± 1%     134µs ± 1%  +12.38%  (p=0.008 n=5+5)
WriteRTP/GCM-100              34.0µs ± 1%    33.8µs ± 1%     ~     (p=0.222 n=5+5)
WriteRTP/GCM-1000              200µs ± 1%     198µs ± 0%   -0.65%  (p=0.016 n=5+4)

name                        old speed      new speed      delta
EncryptRTP/CTR-100          3.88MB/s ± 1%  3.65MB/s ± 0%   -6.08%  (p=0.008 n=5+5)
EncryptRTP/CTR-1000         7.66MB/s ± 1%  6.88MB/s ± 1%  -10.19%  (p=0.008 n=5+5)
EncryptRTP/GCM-100          3.12MB/s ± 5%  3.15MB/s ± 1%     ~     (p=0.754 n=5+5)
EncryptRTP/GCM-1000         4.72MB/s ± 1%  4.75MB/s ± 0%     ~     (p=0.206 n=5+5)
EncryptRTPInPlace/CTR-100   4.17MB/s ± 1%  3.95MB/s ± 0%   -5.37%  (p=0.008 n=5+5)
EncryptRTPInPlace/CTR-1000  8.47MB/s ± 0%  7.61MB/s ± 0%  -10.13%  (p=0.016 n=4+5)
EncryptRTPInPlace/GCM-100   3.35MB/s ± 1%  3.33MB/s ± 1%     ~     (p=0.214 n=5+5)
EncryptRTPInPlace/GCM-1000  5.09MB/s ± 1%  5.08MB/s ± 1%     ~     (p=0.452 n=5+5)
DecryptRTP/CTR              1.50MB/s ± 1%  1.39MB/s ± 0%   -7.21%  (p=0.000 n=5+4)
DecryptRTP/GCM              2.03MB/s ± 0%  1.98MB/s ± 2%   -2.07%  (p=0.008 n=5+5)
Write/CTR-100               3.57MB/s ± 0%  3.34MB/s ± 1%   -6.44%  (p=0.016 n=4+5)
Write/CTR-1000              8.17MB/s ± 1%  7.33MB/s ± 1%  -10.35%  (p=0.008 n=5+5)
Write/GCM-100               2.99MB/s ± 1%  2.95MB/s ± 0%   -1.40%  (p=0.016 n=5+5)
Write/GCM-1000              4.98MB/s ± 0%  4.97MB/s ± 0%   -0.28%  (p=0.048 n=5+5)
WriteRTP/CTR-100            4.13MB/s ± 2%  3.78MB/s ± 1%   -8.48%  (p=0.008 n=5+5)
WriteRTP/CTR-1000           8.46MB/s ± 0%  7.53MB/s ± 1%  -11.01%  (p=0.008 n=5+5)
WriteRTP/GCM-100            3.30MB/s ± 0%  3.31MB/s ± 1%     ~     (p=0.183 n=5+5)
WriteRTP/GCM-1000           5.07MB/s ± 1%  5.10MB/s ± 0%     ~     (p=0.079 n=5+4)

Sep 20 '24 09:09 jech

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 82.99%. Comparing base (e8ac9ae) to head (e62d789).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #315      +/-   ##
==========================================
- Coverage   83.03%   82.99%   -0.04%     
==========================================
  Files          39       39              
  Lines        2729     2729              
==========================================
- Hits         2266     2265       -1     
- Misses        335      336       +1     
  Partials      128      128

Flag	Coverage Δ
go	`82.90% <ø> (-0.04%)`	:arrow_down:
wasm	`65.02% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 20 '24 09:09 codecov[bot]

@jech - it's a shame that the Go team weren't interested in picking up my PR for XorBytes. It just stalled after I completed the submission and as far as I can tell it was never reviewed. I would be happy to hear any suggestions you may have to get it picked up.

Our application (which runs on 32-bit ARM, still very common for embedded systems) is connecting to IoT cameras, many of which are old and most of which use CTR. This PR would decrease performance in those cases and we would have to switch to a fork of pion with optimized XorBytes manually add back in. I'm generally against performance regressions so I don't think this PR should be merged.

Sep 20 '24 13:09 adriancable

Please check the benchmarks, Adrian.

The patch causes a 10% regression in WriteRTP. Assuming your code spends 40% of its CPU doing WriteRTP, this is a 4% regression to your code's performance, which should be negligible.

Perhaps you could share some benchmarks of your application that show whether the change is significant?

Sep 20 '24 15:09 jech

@jech - my code actually spends a lot less than 40% CPU on WriteRTP (this is used only for the camera talkback function), but it spends almost all of its CPU time on ReadRTP of which most of DecryptRTP. The WebRTC receiver is more or less a tight ReadRTP loop and on our platform (Allwinner H3) a single camera consumes 14.0% of one core in processing. We have a budget of 1.5 cores to support as many cameras as we can (the rest is needed for non-WebRTC processing functions), which means we support up to 10 cameras streaming in parallel, consuming a total of 1.4 cores.

Without the ARM32 implemention of XorBytes, single camera CPU consumption increases to 15.2% (which matches your own benchmark closely enough), which means we need to drop the number of supported cameras from 10 to 9 to meet the total budget for WebRTC processing of 1.5 cores. In practice we wouldn't actually make things less capable in a product 'update' so we would need to maintain a separate pion fork with the XorBytes optimization back in. This is just extra work.

In summary, this PR causes a performance regression in circumstances which can be meaningful for some applications (like ours) and doesn't add any benefit, so I don't see a reason to merge it.

Sep 20 '24 15:09 adriancable

Thank you @jech! I should start a list of all the things that got added and we hope to remove someday. I completely forgot about this.

@adriancable as long as you need this it will not be removed! I don't want to see you on a fork.

Sep 20 '24 15:09 Sean-Der

a single camera consumes 14.0% of one core

[...]

Without the ARM32 implemention of XorBytes, single camera CPU consumption increases to 15.2%

These figures make sense.

Adrian, may I encourage you to push https://go-review.googlesource.com/c/go/+/409394 again? It would be great if we could remove some code from Pion.

Sep 20 '24 15:09 jech

@jech - yes, I will happily give it another go. I would love to see this merged into Go - that is where it really should live, rather than Pion.

Sep 20 '24 16:09 adriancable

doesn't add any benefito

The xor_asm.s file contains assembler code which, as far as I know, has never been properly reviewed by anyone other than the author (please correct me if I'm wrong). If there is a security issue in that code, it will endanger the whole Pion community, not just your application.

Moving this code into the Go stdlib, will ensure that it gets reviewed. Failing that, removing it altogether is the safe way out.

Sep 20 '24 16:09 jech

Hi @jech - the code in xor_asm.s (as something written specifically for Pion) has been reviewed in much the same way as all code written for the Pion code base has been. In general though, I would expect that Pion PRs get fewer eyes on them than Go PRs, which is somewhat inevitable given that we have a much smaller contributor base than Go has, but still want to maintain reasonable development velocity.

Security issues are definitely a valid concern but I don't think apply more to this piece of code in Pion than other contributions, beyond the fact I suppose that being written in ARM32 assembly language makes it a little more 'esoteric' than most contributions.

Sep 20 '24 17:09 adriancable