ada
ada copied to clipboard
Build the string directly instead of copying and then pruning tabs or newlines
Following the comment below, I add a new function called get_ascii_tab_or_newline_removed to build the string directly instead of copying and then pruning tabs or newlines.
https://github.com/ada-url/ada/blob/d8f77a1d7c7928fa67ddca8ff6116bf3758f083f/src/url.cpp#L500-L502
Note:
- Due to NRVO in C++ compilers, returning a local variable back to caller should be effective as pass pointers. (neither copy ctor nor move ctor will be called)
- If we add something like
inlineoralways_inlineattribute to a function, it is better to put the function definition to the header file instead of source file. Since all inline optimizitions cannot take effect if the definition cannot be reached inside the current translation unit (except link-time optimization enabled). (the definition may be in another translation unit if we put them in a source file)
Thanks.
Please consider running with ADA_BENCHMARKS=ON (e.g., cmake -B build -D ADA_BENCHMARKS=ON && cmake --build build) and then execute ./build/benchmarks/benchdata before and after your changes.
We need some evidence that the change is beneficial (meaning that the code runs faster).
Please consider running with
ADA_BENCHMARKS=ON(e.g.,cmake -B build -D ADA_BENCHMARKS=ON && cmake --build build) and then execute./build/benchmarks/benchdatabefore and after your changes.
Sure. Here is the benchmark result:
Before:
Loading /home/twice/projects/ada/dependencies/.cache/url-dataset/out.txt
2023-10-14T10:41:21+09:00
Running ./benchmarks/benchdata
Run on (20 X 2918.4 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x10)
L1 Instruction 32 KiB (x10)
L2 Unified 1280 KiB (x10)
L3 Unified 24576 KiB (x1)
Load Average: 1.05, 0.64, 0.40
ada spec: Ada follows whatwg/url
bad urls: ---------------------
ada---count of bad URLs 26
whatwg---count of bad URLs 26
-------------------------------
bytes/URL: 86.859205
curl : OMITTED
input bytes: 8688092
number of URLs: 100025
performance counters: No privileged access (sudo may help).
zuri : OMITTED
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href 23837537 ns 23837653 ns 30 speed=364.469M/s time/byte=2.74372ns time/url=238.317ns url/s=4.19609M/s
BasicBench_AdaURL_aggregator_href 15135664 ns 15135570 ns 47 speed=574.018M/s time/byte=1.74211ns time/url=151.318ns url/s=6.6086M/s
BasicBench_whatwg 41866413 ns 41866288 ns 16 speed=207.52M/s time/byte=4.81881ns time/url=418.558ns url/s=2.38915M/s
After:
Loading /home/twice/projects/ada/dependencies/.cache/url-dataset/out.txt
2023-10-14T10:41:26+09:00
Running ./benchmarks/benchdata-new
Run on (20 X 2918.4 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x10)
L1 Instruction 32 KiB (x10)
L2 Unified 1280 KiB (x10)
L3 Unified 24576 KiB (x1)
Load Average: 1.04, 0.65, 0.40
ada spec: Ada follows whatwg/url
bad urls: ---------------------
ada---count of bad URLs 26
whatwg---count of bad URLs 26
-------------------------------
bytes/URL: 86.859205
curl : OMITTED
input bytes: 8688092
number of URLs: 100025
performance counters: No privileged access (sudo may help).
zuri : OMITTED
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
BasicBench_AdaURL_href 23819007 ns 23819062 ns 29 speed=364.754M/s time/byte=2.74158ns time/url=238.131ns url/s=4.19937M/s
BasicBench_AdaURL_aggregator_href 14740361 ns 14740285 ns 47 speed=589.411M/s time/byte=1.69661ns time/url=147.366ns url/s=6.78583M/s
BasicBench_whatwg 40722269 ns 40722100 ns 17 speed=213.351M/s time/byte=4.68712ns time/url=407.119ns url/s=2.45628M/s
The benchmark results are unstable (and only for x86), so I think you can run on your side if you are interested : )
@anonrig I will assess this soon. I recommend waiting before we include it in a release. (Not because I think it is bad, but because we want to make sure we document the benefits.)
It provides a different output for me. If you run the script with sudo, our benchmarks enable performance counters, which provide much better output.
Before:
BasicBench_AdaURL_aggregator_href 17023800 ns 17022634 ns 41 GHz=3.22958 cycle/byte=6.28781 cycles/url=546.155 instructions/byte=28.6539 instructions/cycle=4.55705 instructions/ns=14.7173 instructions/url=2.48885k ns/url=169.11 speed=510.385M/s time/byte=1.95931ns time/url=170.184ns url/s=5.876M/s
After:
BasicBench_AdaURL_aggregator_href 17118766 ns 17118780 ns 41 GHz=3.22878 cycle/byte=6.3365 cycles/url=550.383 instructions/byte=28.6567 instructions/cycle=4.52248 instructions/ns=14.6021 instructions/url=2.4891k ns/url=170.462 speed=507.518M/s time/byte=1.97037ns time/url=171.145ns url/s=5.843M/s```
I am also getting so-so results with GCC 12 and a recent Intel compiler.
Before:
BasicBench_AdaURL_aggregator_href 22788603 ns 22752369 ns 31 GHz=3.18841 cycle/byte=8.375 cycles/url=727.446 instructions/byte=25.1422 instructions/cycle=3.00205 instructions/ns=9.57178 instructions/url=2.18383k ns/url=228.153 speed=381.854M/s time/byte=2.6188ns time/url=227.467ns url/s=4.39625M/s
After:
BasicBench_AdaURL_aggregator_href 23516792 ns 23485895 ns 30 GHz=3.18849 cycle/byte=8.83776 cycles/url=767.641 instructions/byte=25.1306 instructions/cycle=2.84355 instructions/ns=9.06662 instructions/url=2.18282k ns/url=240.754 speed=369.928M/s time/byte=2.70323ns time/url=234.8ns url/s=4.25894M/s
As you can see, you save about 0.01 instructions per byte (which is zero) and the performance is down. So this PR does not seem to improve the performance.
When I was trying to improve can_parse I tried doing optimizations like here, especially allocation of strings and stuff, got same results, either worse performance or almost no difference
I think the biggest performance improvement would be achieved by changing stuff on line 499 tmp_buffer=input
and I think the comment relates to that piece of code, as it copies chars, so what has to be changed IMO is tmp_buffer creation/initialization