ada Build the string directly instead of copying and then pruning tabs or newlines

Following the comment below, I add a new function called get_ascii_tab_or_newline_removed to build the string directly instead of copying and then pruning tabs or newlines.

https://github.com/ada-url/ada/blob/d8f77a1d7c7928fa67ddca8ff6116bf3758f083f/src/url.cpp#L500-L502

Note:

Due to NRVO in C++ compilers, returning a local variable back to caller should be effective as pass pointers. (neither copy ctor nor move ctor will be called)
If we add something like inline or always_inline attribute to a function, it is better to put the function definition to the header file instead of source file. Since all inline optimizitions cannot take effect if the definition cannot be reached inside the current translation unit (except link-time optimization enabled). (the definition may be in another translation unit if we put them in a source file)

Oct 13 '23 15:10 PragmaTwice

Thanks.

Please consider running with ADA_BENCHMARKS=ON (e.g., cmake -B build -D ADA_BENCHMARKS=ON && cmake --build build) and then execute ./build/benchmarks/benchdata before and after your changes.

We need some evidence that the change is beneficial (meaning that the code runs faster).

Oct 13 '23 16:10 lemire

Please consider running with ADA_BENCHMARKS=ON (e.g., cmake -B build -D ADA_BENCHMARKS=ON && cmake --build build) and then execute ./build/benchmarks/benchdata before and after your changes.

Sure. Here is the benchmark result:

Before:

Loading /home/twice/projects/ada/dependencies/.cache/url-dataset/out.txt
2023-10-14T10:41:21+09:00
Running ./benchmarks/benchdata
Run on (20 X 2918.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 1.05, 0.64, 0.40
ada spec: Ada follows whatwg/url
bad urls: ---------------------
ada---count of bad URLs       26
whatwg---count of bad URLs    26
-------------------------------

bytes/URL: 86.859205
curl : OMITTED
input bytes: 8688092
number of URLs: 100025
performance counters: No privileged access (sudo may help).
zuri : OMITTED
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href              23837537 ns     23837653 ns           30 speed=364.469M/s time/byte=2.74372ns time/url=238.317ns url/s=4.19609M/s
BasicBench_AdaURL_aggregator_href   15135664 ns     15135570 ns           47 speed=574.018M/s time/byte=1.74211ns time/url=151.318ns url/s=6.6086M/s
BasicBench_whatwg                   41866413 ns     41866288 ns           16 speed=207.52M/s time/byte=4.81881ns time/url=418.558ns url/s=2.38915M/s

After:

Loading /home/twice/projects/ada/dependencies/.cache/url-dataset/out.txt
2023-10-14T10:41:26+09:00
Running ./benchmarks/benchdata-new
Run on (20 X 2918.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 1.04, 0.65, 0.40
ada spec: Ada follows whatwg/url
bad urls: ---------------------
ada---count of bad URLs       26
whatwg---count of bad URLs    26
-------------------------------

bytes/URL: 86.859205
curl : OMITTED
input bytes: 8688092
number of URLs: 100025
performance counters: No privileged access (sudo may help).
zuri : OMITTED
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href              23819007 ns     23819062 ns           29 speed=364.754M/s time/byte=2.74158ns time/url=238.131ns url/s=4.19937M/s
BasicBench_AdaURL_aggregator_href   14740361 ns     14740285 ns           47 speed=589.411M/s time/byte=1.69661ns time/url=147.366ns url/s=6.78583M/s
BasicBench_whatwg                   40722269 ns     40722100 ns           17 speed=213.351M/s time/byte=4.68712ns time/url=407.119ns url/s=2.45628M/s

The benchmark results are unstable (and only for x86), so I think you can run on your side if you are interested : )

Oct 14 '23 01:10 PragmaTwice

@anonrig I will assess this soon. I recommend waiting before we include it in a release. (Not because I think it is bad, but because we want to make sure we document the benefits.)

Oct 15 '23 18:10 lemire

It provides a different output for me. If you run the script with sudo, our benchmarks enable performance counters, which provide much better output.

Before:

BasicBench_AdaURL_aggregator_href   17023800 ns     17022634 ns           41 GHz=3.22958 cycle/byte=6.28781 cycles/url=546.155 instructions/byte=28.6539 instructions/cycle=4.55705 instructions/ns=14.7173 instructions/url=2.48885k ns/url=169.11 speed=510.385M/s time/byte=1.95931ns time/url=170.184ns url/s=5.876M/s

After:

BasicBench_AdaURL_aggregator_href   17118766 ns     17118780 ns           41 GHz=3.22878 cycle/byte=6.3365 cycles/url=550.383 instructions/byte=28.6567 instructions/cycle=4.52248 instructions/ns=14.6021 instructions/url=2.4891k ns/url=170.462 speed=507.518M/s time/byte=1.97037ns time/url=171.145ns url/s=5.843M/s```

Oct 15 '23 18:10 anonrig

I am also getting so-so results with GCC 12 and a recent Intel compiler.

Before:

BasicBench_AdaURL_aggregator_href   22788603 ns     22752369 ns           31 GHz=3.18841 cycle/byte=8.375 cycles/url=727.446 instructions/byte=25.1422 instructions/cycle=3.00205 instructions/ns=9.57178 instructions/url=2.18383k ns/url=228.153 speed=381.854M/s time/byte=2.6188ns time/url=227.467ns url/s=4.39625M/s

After:

BasicBench_AdaURL_aggregator_href   23516792 ns     23485895 ns           30 GHz=3.18849 cycle/byte=8.83776 cycles/url=767.641 instructions/byte=25.1306 instructions/cycle=2.84355 instructions/ns=9.06662 instructions/url=2.18282k ns/url=240.754 speed=369.928M/s time/byte=2.70323ns time/url=234.8ns url/s=4.25894M/s

As you can see, you save about 0.01 instructions per byte (which is zero) and the performance is down. So this PR does not seem to improve the performance.

Oct 15 '23 20:10 lemire

When I was trying to improve can_parse I tried doing optimizations like here, especially allocation of strings and stuff, got same results, either worse performance or almost no difference

I think the biggest performance improvement would be achieved by changing stuff on line 499 tmp_buffer=input and I think the comment relates to that piece of code, as it copies chars, so what has to be changed IMO is tmp_buffer creation/initialization

Apr 30 '24 20:04 darowny

ada ada copied to clipboard

Build the string directly instead of copying and then pruning tabs or newlines

ada
ada copied to clipboard