couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

Replace `b64url` with Erlang/OTP stdlib `base64`

Open Benjamin-Philip opened this issue 4 weeks ago • 5 comments

Presently, url-safe base64 encoding is handled by b64url NIF at src/b64url. However, support for RFC 4648 compliant url-safe encoding was added to Erlang stdlib's base64 in Erlang/OTP 26.0. Additionally, encoding was made upto 4 times faster thanks to the JIT compiler that was merged in the same release.

Benchmarking base64 and b64url with benchee[^1], with the following benchmark, we find that the built-in base64 is faster:

Mix.install([:benchee, {:b64url, github: "apache/couchdb", sparse: "src/b64url/"}])

defmodule B64Bench do
  def main do
    [workers, min_size, max_size, duration, entries] =
      Enum.map(System.argv(), &String.to_integer/1)

    bytes =
      1..entries
      |> Enum.to_list()
      |> Enum.map(fn _ ->
        :crypto.strong_rand_bytes(min_size + :rand.uniform(max_size - min_size))
      end)

    Benchee.run(
      %{
        "b64url" => fn input -> process(input, &:b64url.encode/1, &:b64url.decode/1) end,
        "base64 (standard) + re" => fn input ->
          process(
            input,
            fn url ->
              url = :erlang.iolist_to_binary(:re.replace(:base64.encode(url), "=+$", ""))
              url = :erlang.iolist_to_binary(:re.replace(url, "/", "_", [:global]))
              :erlang.iolist_to_binary(:re.replace(url, "\\+", "-", [:global]))
            end,
            fn url64 ->
              url64 = :erlang.iolist_to_binary(url64)
              url64 = :erlang.iolist_to_binary(:re.replace(url64, "-", "+", [:global]))
              url64 = :erlang.iolist_to_binary(:re.replace(url64, "_", "/", [:global]))

              padding =
                :erlang.list_to_binary(
                  :lists.duplicate(rem(4 - rem(:erlang.size(url64), 4), 4), 61)
                )

              :base64.decode(<<url64::binary, padding::binary>>)
            end
          )
        end,
        "base64 (urlsafe)" => fn input ->
          process(
            input,
            &:base64.encode(&1, %{mode: :urlsafe}),
            &:base64.decode(&1, %{mode: :urlsafe})
          )
        end
      },
      parallel: workers,
      time: duration,
      inputs: %{"generated" => bytes}
    )

    IO.inspect(:erlang.byte_size(Enum.join(bytes)), label: "Total size (B)")
  end

  def process(bytes, encode, decode) do
    Enum.each(bytes, fn bin -> decode.(encode.(bin)) end)
  end
end

B64Bench.main()
$ elixir b64_bench.exs 4 10 100 60 100
Operating System: Linux
CPU Information: 12th Gen Intel(R) Core(TM) i7-1255U
Number of Available Cores: 12
Available memory: 15.31 GB
Elixir 1.18.4
Erlang 28.0.2
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 1 min
memory time: 0 ns
reduction time: 0 ns
parallel: 4
inputs: generated
Estimated total run time: 3 min 6 s
Excluding outliers: false

Benchmarking b64url with input generated ...
Benchmarking base64 (standard) + re with input generated ...
Benchmarking base64 (urlsafe) with input generated ...
Calculating statistics...
Formatting results...

##### With input generated #####
Name                             ips        average  deviation         median         99th %
base64 (urlsafe)              8.18 K      122.31 μs    ±44.53%      110.85 μs      296.40 μs
b64url                        6.18 K      161.88 μs    ±56.17%      139.65 μs      609.83 μs
base64 (standard) + re        0.74 K     1345.47 μs    ±32.12%     1076.60 μs     2221.58 μs

Comparison: 
base64 (urlsafe)              8.18 K
b64url                        6.18 K - 1.32x slower +39.57 μs
base64 (standard) + re        0.74 K - 11.00x slower +1223.16 μs
Total size (B): 5491

Therefore I propose we drop b64url in favour of the stdlib functions. This has the following benefits:

  • Less code to maintain
  • (marginally) better peformance
  • Enhanced safety (by way of eliminating an NIF)

[^1]: I found updating the existing benchmarks to compare 3 or more implementations tedious. The new benchmark's arguments are similar to the previous benchmark, with the exception of an extra parameter entries, which is the number of random binaries to encode. The ips results are directly proportional to the previous bps and can be converted to bps by multiplying by total size.

Benjamin-Philip avatar Dec 02 '25 14:12 Benjamin-Philip

A good idea to investigate @Benjamin-Philip

With the built-in script I still see that the nif is faster even with the jit. Without the jit it's even more pronounced. Maybe we could do a length limit, if it's < 100 bytes we use the built-ins, otherwise use the nif? Pretty sure there is a way to check if we're running a jitted emulator build too.

I patched the built-in script like this:

--- i/src/b64url/test/benchmark.escript
+++ w/src/b64url/test/benchmark.escript
@@ -116,25 +116,18 @@ run_worker(St, Started) ->

 do_round_trip(St) ->
     Size = St#st.minsize + rand:uniform(St#st.maxsize - St#st.minsize),
-    Data = crypto:strong_rand_bytes(Size),
+    Data = rand:bytes(Size),
     Encoded = (St#st.module):encode(Data),
     Data = (St#st.module):decode(Encoded),
     St#st{total_bytes=St#st.total_bytes+Size}.


 encode(Url) ->
-    Url1 = iolist_to_binary(re:replace(base64:encode(Url), "=+$", "")),
-    Url2 = iolist_to_binary(re:replace(Url1, "/", "_", [global])),
-    iolist_to_binary(re:replace(Url2, "\\+", "-", [global])).
+    base64:encode(Url, #{mode => urlsafe, padding => false}).


 decode(Url64) ->
-    Url1 = re:replace(iolist_to_binary(Url64), "-", "+", [global]),
-    Url2 = iolist_to_binary(
-        re:replace(iolist_to_binary(Url1), "_", "/", [global])
-    ),
-    Padding = list_to_binary(lists:duplicate((4 - size(Url2) rem 4) rem 4, $=)),
-    base64:decode(<<Url2/binary, Padding/binary>>).
+    base64:decode(Url64, #{mode => urlsafe, padding => false}).

Probably don't need to do the re replacement, in this part, I think we only handle the urlsafe version. We also don't use padding.

url64 = :erlang.iolist_to_binary(:re.replace(url64, "-", "+", [:global]))
url64 = :erlang.iolist_to_binary(:re.replace(url64, "_", "/", [:global]))

nickva avatar Dec 02 '25 21:12 nickva

First things first, I absolutely don’t care at all about the switch to stdlib, but I do feel required to refute one point in particular:

  • Enhanced safety (by way of eliminating an NIF)

The last bug fix in b64url near as I can tell was this commit from almost exactly twelve years ago. Which means that at least in terms of bugs per LoC, b64url has been safer than the Erlang VM itself for more than a decade.

This PR comment is brought to you by @nickva making a joke and me feeling a bit cheeky. I haven’t spent much (any) time thinking about this PR in terms of actual performance, so I’ll leave that to y’all. I’ll even +1 the b64url removal if that’s the decision!

davisp avatar Dec 02 '25 23:12 davisp

I patched the built-in script like this:

I left the regex implementation for completeness, but since we don't need to bench the old version, replacing it with the url-safe version works too.

Pretty sure there is a way to check if we're running a jitted emulator build too

I'm not too familiar with how you package and distribute couchdb. I believe that the JIT is included in the default build, so it shouldn't be a problem in most deployments.

Maybe we could do a length limit, if it's < 100 bytes we use the built-ins, otherwise use the nif?

Are you getting different results in this range? Typically how large are the bytes you encode?

With the built-in script I still see that the nif is faster even with the jit.

Could you share the parameters you benchmarked with so that I can investigate?

Benjamin-Philip avatar Dec 03 '25 03:12 Benjamin-Philip

Looking through the built-in benchmark, I noticed a few flaws:

Firstly, the overhead of generating the random bytes and (some of) the benchmarking logic is included in the benchmark. This was mentioned in the Readme.

Now, the overhead itself is fine if it is constant - the difference in bytes per second is the speed difference. However, :crypto:rand takes longer in an increase in N. Even if N was constant, we don't know if the deviation in the time it takes is significant. This is the second flaw.

Finally, we're benchmarking the two implementations with two random different inputs. Is this still an apples to apples comparison? Maybe a better approach would be to benchmark with a common random input?

I'm not sure if my benchmark suffers from some of the same flaws, but maybe moving from handwritten benchmarks to some benchmarking framework would be better?

Benjamin-Philip avatar Dec 03 '25 07:12 Benjamin-Philip

Could you share the parameters you benchmarked with so that I can investigate?

I am using

./test/benchmark.escript 1 100 10000 60
nif :     4803827309 bytes /  60 seconds =    80063788.48 bps
erl :     3093572433 bytes /  60 seconds =    51559540.55 bps

On an intel i5, ubuntu, with erlang 26 compiled with jit

Erlang/OTP 26 [erts-14.2.5.11] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]

 cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 142
model name	: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
...

Firstly, the overhead of generating the random bytes and (some of) the benchmarking logic is included in the benchmark. This was mentioned in the Readme.

Now, the overhead itself is fine if it is constant - the difference in bytes per second is the speed difference. However, :crypto:rand takes longer in an increase in N. Even if N was constant, we don't know if the deviation in the time it takes is significant. This is the second flaw.

Yeah good point, that should taken out of the timing loop. Pre-generate a few and then select between them or just go through some length ranges individually (benchmark with length 10, 100, 1000, 10000). Or if we stick with random go use a common seed value.

Finally, we're benchmarking the two implementations with two random different inputs. Is this still an apples to apples comparison? Maybe a better approach would be to benchmark with a common random input?

I'm not sure if my benchmark suffers from some of the same flaws, but maybe moving from handwritten benchmarks to some benchmarking framework would be better?

Well the benchmarking script is already there so we just updated it to use the latest built-in functions. But something like benchee or even better erlperf is nice, too.

Are you getting different results in this range? Typically how large are the bytes you encode?

Yeah the ranges I am looking are 100 - 10000. A place b64 encoding would matter is when emitting changes feeds sequences. For Q=64, N=1 db the size of sequence string is about 1200 bytes. For Q=2 would it closer to 120 bytes and such. This also used for attachment encoding when they are in-line in the document body (as opposed to updated directly).

nickva avatar Dec 04 '25 03:12 nickva

I looked into this, and yes, b64url is faster on sizes above 100 bytes.

Benchmark

I made some improvements in this patch (changes explained in the commit msg).

Applying the patch:

git checkout -b b64url-bench
git am ~/Downloads/0001-Remove-generation-overhead-from-b64url-benchmark.patch

and then benchmarking for different sizes:

cd src/b64url
alias bench="ERL_LIBS=_build/default/lib/b64url/ ./test/benchmark.escript"

for power in $(seq 1 3); do
    bench 1 $((10 ** power)) $((10 ** ($power + 1))) 60 100             
done

I finally get:

Workers: 1, MinSize: 10, MaxSize: 100, Duration: 60, SampleSize: 100
erl :     4752923375 bytes /  60 seconds =    79215389.58 bps
nif :     3055668280 bytes /  60 seconds =    50927804.67 bps
1.5554448125501372 times slower
Workers: 1, MinSize: 100, MaxSize: 1000, Duration: 60, SampleSize: 100
nif :    19462006451 bytes /  60 seconds =   324366774.18 bps
erl :     8901825341 bytes /  60 seconds =   148363755.68 bps
2.186293901022968 times slower
Workers: 1, MinSize: 1000, MaxSize: 10000, Duration: 60, SampleSize: 100
nif :    29760976505 bytes /  60 seconds =   496016275.08 bps
erl :    11500940039 bytes /  60 seconds =   191682333.98 bps
2.5876994753541642 times slower

As you can see, the erlang version becomes progressively slower as the size increases (peaking at about 50 bytes), and the difference is significant (upto 2.5x).

If you calculate the difference in performance on 10,000 bytes, you get a sub-millisecond value: 0.032 ms.

Additionally, if you compare that last range as the number of parallel workers increases, the difference decreases to just 1.87x:

for power in $(seq 1 3); do
    bench $((10 ** power)) 1000 10000 60 100             
done

Workers: 10, MinSize: 1000, MaxSize: 10000, Duration: 60, SampleSize: 100
nif :   114522454433 bytes /  60 seconds =  1908707573.88 bps
erl :    51406532861 bytes /  60 seconds =   856775547.68 bps
2.2277801683817393 times slower
Workers: 100, MinSize: 1000, MaxSize: 10000, Duration: 60, SampleSize: 100
nif :   100150195411 bytes /  60 seconds =  1669169923.52 bps
erl :    46881336505 bytes /  60 seconds =   781355608.42 bps
2.136248726618934 times slower
Workers: 1000, MinSize: 1000, MaxSize: 10000, Duration: 60, SampleSize: 100
nif :    83513632230 bytes /  60 seconds =  1391893870.50 bps
erl :    44569748335 bytes /  60 seconds =   742829138.92 bps
1.873773924014238 times slower

For reference this is my environment:

$ inxi -MSC      
System:
  Host: rivendell Kernel: 6.17.11-200.fc42.x86_64 arch: x86_64 bits: 64
  Desktop: GNOME v: 48.7 Distro: Fedora Linux 42 (Workstation Edition)
Machine:
  Type: Laptop System: LENOVO product: 21C1S0SM00 v: ThinkPad L14 Gen 3
    serial: <superuser required>
  Mobo: LENOVO model: 21C1S0SM00 serial: <superuser required> UEFI: LENOVO
    v: R1XET54W (1.36 ) date: 07/01/2024
CPU:
  Info: 10-core (2-mt/8-st) model: 12th Gen Intel Core i7-1255U bits: 64
    type: MST AMCP cache: L2: 6.5 MiB
  Speed (MHz): avg: 4481 min/max: 400/4700:3500 cores: 1: 4481 2: 4481
    3: 4481 4: 4481 5: 4481 6: 4481 7: 4481 8: 4481 9: 4481 10: 4481 11: 4481
    12: 4481
$ erl -s erlang halt

Erlang/OTP 28 [erts-16.0.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]

Conclusion

The point I'm trying to make is that in my mind a 0.032 ms overhead on the worst-case input is still competitive. In the real world, this difference might even be lesser.

Ultimately, it comes down to whether the sub-millisecond overhead in an acceptable tradeoff for eliminating an entire submodule (and better complying with Erlang standard practice). I'm not familiar with what components are impacted by b64url's performance, and I don't know how performance sensitive couchdb's users are.

You might also conclude that removing b64url doesn't make any meaningful improvement to the maintenance workload since b64url is so rarely updated.

Moving Forward

Moving forward, I see 3 options:

Option 1 - Completely replace b64url

We completely replace b64url if we're not too performance sensitive, and the tradeoff is worthwhile.

Option 2 - Keep everything as is

We make no changes if we are performance sensitive, and revisit this later when stdlib performance improves.

Option 3 - Replace b64url for data less than 100 bytes

If we're hyper performance sensitive, we handle binaries less than 100 bytes with base64 to take advantage of the speed up on small binaries.

Benjamin-Philip avatar Dec 15 '25 18:12 Benjamin-Philip

@Benjamin-Philip thanks for additional benchmarking and confirming the results.

I liked the idea of generating the benchmark data separately and did that and replaced the script with a call to erlperf in a PR in https://github.com/apache/couchdb/pull/5824. Also, thanks for suggestion to use a proper benchmarking suite!

Yeah, it seems hardly worth it to add a separate clause for < 100 bytes. I think that might speed up the change feeds but only for something like a q=1&n=1 databases. With larger q values (q>4) we'be over 100 bytes already and it's almost not worth having an extra case in the code just for it. For inline attachment base-64 binaries could be arbitrarily large, so then the nif speed becomes more important. When decoding a 10MB binary we're talking about 20msec vs 60msec which may be noticeable by users.

You might also conclude that removing b64url doesn't make any meaningful improvement to the maintenance workload since b64url is so rarely updated.

b64url does seem to make a difference with larger binaries, and like @davisp had mentioned the b64url nif has worked rock solid for more than a decade. To replace it we'd want the OTP version to be pretty close to the NIF performance, it may even be a tiny bit slower, but so far it's more than a tiny bit. So maybe it's worth benchmarking after a few more OTP releases, so I think that means we'd go with Option 2 for now.

nickva avatar Dec 16 '25 09:12 nickva