sd
sd copied to clipboard
Benchmark is wrong
Issue
In benchmark, I've noticed that you weren't using -p option for sd to print everything to stdout. Nevertheless, sed-commands print everything to stdout.
Also, once sd "(\w+)" "$1$1" dump.json >/dev/null is performed, every word in file is deleted. This happens because $1 is replaced by shell with (empty string) and sd performs 'in-place' (or 'inline') replacement.
Experiment
Here is my run for a simple thing
➜ ~/tmp echo '{"hello": "world"}' > test.txt
➜ ~/tmp cat test.txt
{"hello": "world"}
➜ ~/tmp hyperfine 'sd "(\w+)" "$1$1" test.txt'
Benchmark #1: sd "(\w+)" "$1$1" test.txt
Time (mean ± σ): 6.7 ms ± 1.1 ms [User: 3.2 ms, System: 1.8 ms]
Range (min … max): 5.6 ms … 12.2 ms 245 runs
➜ ~/tmp cat test.txt
{"": ""}
Please pay attention to the second cat output.
This is the reason why almost every run of sd is so fast (except the first one) — it doesn't do anything but just reading the file.
The following command should be used to compete with sed:
hyperfine 'sd -p "(\w+)" "\$1\$1" test.txt > /dev/null'
Please note the escaped groups \$1 and the preview option -p
Experiment Results
Here are my results for a 120 MB file
➜ ~/tmp l dump.json
.rw-r--r--@ 120M sergey 2 Aug 22:20 dump.json
➜ ~/tmp hyperfine \
'sed -E "s:(\w+):\1\1:g" dump.json >/dev/null' \
"sed 's:\(\w\+\):\1\1:g' dump.json >/dev/null" \
'sd -p "(\w+)" "\$1\$1" dump.json >/dev/null'
Benchmark #1: sed -E "s:(\w+):\1\1:g" dump.json >/dev/null
Time (mean ± σ): 5.724 s ± 0.056 s [User: 5.489 s, System: 0.146 s]
Range (min … max): 5.656 s … 5.849 s 10 runs
Benchmark #2: sed 's:\(\w\+\):\1\1:g' dump.json >/dev/null
Time (mean ± σ): 2.614 s ± 0.034 s [User: 2.493 s, System: 0.084 s]
Range (min … max): 2.569 s … 2.676 s 10 runs
Benchmark #3: sd -p "(\w+)" "\$1\$1" dump.json >/dev/null
Time (mean ± σ): 12.590 s ± 0.216 s [User: 12.087 s, System: 0.303 s]
Range (min … max): 12.403 s … 13.150 s 10 runs
Summary
'sed 's:\(\w\+\):\1\1:g' dump.json >/dev/null' ran
2.19 ± 0.04 times faster than 'sed -E "s:(\w+):\1\1:g" dump.json >/dev/null'
4.82 ± 0.10 times faster than 'sd -p "(\w+)" "\$1\$1" dump.json >/dev/null'
➜ ~/tmp l dump.json
.rw-r--r--@ 120M sergey 2 Aug 22:20 dump.json
Thoughts
~Even if we fixed the benchmark, I do think that we are capped with pipe throughput.~
UPD: Ok, apparently pipe is not a problem.
Platform
MBP 2015, 2.7 GHz Intel Core i5
So, an important update!
Even my benchmark above is broken - sed on mac is not the same as on Linux. Therefore, I switched to gsed
➜ ~/tmp hyperfine \
'gsed -E "s:(\w+):\1\1:g" dump.json >/dev/null' \
"gsed 's:\(\w\+\):\1\1:g' dump.json >/dev/null" \
'sd -p "(\w+)" "\$1\$1" dump.json >/dev/null'
Benchmark #1: gsed -E "s:(\w+):\1\1:g" dump.json >/dev/null
Time (mean ± σ): 39.251 s ± 2.217 s [User: 37.303 s, System: 0.765 s]
Range (min … max): 37.511 s … 43.916 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark #2: gsed 's:\(\w\+\):\1\1:g' dump.json >/dev/null
Time (mean ± σ): 37.544 s ± 0.723 s [User: 36.282 s, System: 0.594 s]
Range (min … max): 36.911 s … 38.991 s 10 runs
Benchmark #3: sd -p "(\w+)" "\$1\$1" dump.json >/dev/null
Time (mean ± σ): 12.599 s ± 0.183 s [User: 12.076 s, System: 0.307 s]
Range (min … max): 12.430 s … 12.940 s 10 runs
Summary
'sd -p "(\w+)" "\$1\$1" dump.json >/dev/null' ran
2.98 ± 0.07 times faster than 'gsed 's:\(\w\+\):\1\1:g' dump.json >/dev/null'
3.12 ± 0.18 times faster than 'gsed -E "s:(\w+):\1\1:g" dump.json >/dev/null'
sd is about 3 times faster, than gsed, but still is not like the advertised 11x.
Hello @qezz
In benchmark, I've noticed that you weren't using -p option for sd to print everything to stdout. Nevertheless, sed-commands print everything to stdout.
The command for the benchmark was written before the -p option was introduced.
I'm glad you tried to replicate the results. I will investigate potential performance regressions as soon as I get some free time.
I tried to replicate the results as well, even with the commit https://github.com/chmln/sd/commit/324fd1c132a5c63212e43497496bd106b9cb57b3 where the benchmarks were added to the README.md. But with no success, I can’t reach the advertised 11x either. As @qezz already mentioned, it seems like the benchmark is wrong:
Also, once sd "(\w+)" "$1$1" dump.json >/dev/null is performed, every word in file is deleted. This happens because $1 is replaced by shell with (empty string) and sd performs 'in-place' (or 'inline') replacement.
My benchmark for the commit https://github.com/chmln/sd/commit/324fd1c132a5c63212e43497496bd106b9cb57b3:
Benchmark #1: sed -i -E "s:(\w+):\1\1:g" dump.json
Time (mean ± σ): 7.791 s ± 0.076 s [User: 7.583 s, System: 0.166 s]
Range (min … max): 7.723 s … 7.935 s 10 runs
Benchmark #2: sed -i 's:\(\w\+\):\1\1:g' dump.json
Time (mean ± σ): 7.877 s ± 0.157 s [User: 7.672 s, System: 0.160 s]
Range (min … max): 7.712 s … 8.121 s 10 runs
Benchmark #3: sd -i "(\w+)" "\$1\$1" dump.json
Time (mean ± σ): 4.292 s ± 0.040 s [User: 3.983 s, System: 0.271 s]
Range (min … max): 4.240 s … 4.372 s 10 runs
Summary
'sd -i "(\w+)" "\$1\$1" dump.json' ran
1.82 ± 0.02 times faster than 'sed -i -E "s:(\w+):\1\1:g" dump.json'
1.84 ± 0.04 times faster than 'sed -i 's:\(\w\+\):\1\1:g' dump.json'