Sed script libmode
I've been working on a version of the sed script that caches compiled files. In the process, I reorganized the sed script to
- be source-able as a library
- not depend on any
./files orcdcommands (by usingmake -Cand figuring out absolute paths names forgenerated_fileetc).
Besides being extensible and not requiring any hacks, I think the reorganized code is somewhat clearer, so I'm opening a pull request.
FWIW, I've benchmarked (results below) sed-cached with a simple sed command. It's about 15 times faster than compiling each time, but still twice as slow as native sed (only 75% slower on busybox sh with builtin applets). Yeah, it should be written in C not sh. But it was a fun as a proof of concept :)
I'm curious what you think of this. Perhaps sed-cached can make it to contrib/. Perhaps a C version would be even faster than sed, and then maybe sed-bin is not so useless after all.
hyperfine --export-markdown /tmp/sed-bin.md -w2 -L sed /S/sed-bin/sed,/S/sed-bin/sed-cached,'busybox sh /S/sed-bin/sed-cached',sed 'for i in $(seq 1 5); do {sed} s/:/_/g /etc/passwd; done
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
for i in $(seq 1 5); do /S/sed-bin/sed s/:/_/g /etc/passwd; done |
397.3 ± 8.2 | 383.1 | 408.5 | 33.07 ± 0.84 |
for i in $(seq 1 5); do /S/sed-bin/sed-cached s/:/_/g /etc/passwd; done |
25.5 ± 0.2 | 25.0 | 26.0 | 2.12 ± 0.03 |
for i in $(seq 1 5); do busybox sh /S/sed-bin/sed-cached s/:/_/g /etc/passwd; done |
21.4 ± 0.3 | 19.4 | 22.0 | 1.78 ± 0.04 |
for i in $(seq 1 5); do sed s/:/_/g /etc/passwd; done |
12.0 ± 0.2 | 11.1 | 12.6 | 1.00 |
Just to be clear, my sed-cached script is on a separate sed-cached branch; this pull request only includes changes to sed.
... and, finally, sed-busting performance for shell scripts, with the latest sed-cached used in library-mode. It's about 4x faster than calling sed in a regular shell (and 2x faster than calling the builtin sed, if running in a busybox with static applets)
To use this, one has to pre-define SED_DIR, SED_LIBMODE, source the sed and sed-cached scripts from the repo, and call the entrypoint function whenever desired. It takes care of computing a hash for the script and looking up a cached binary, or compiling one.
Notes at the end of https://github.com/kstr0k/sed-bin/blob/sed-cached/sed-cached
I really like the idea, let me look into this
Sure, have a look and let me know if you want to do things differently. I've converted this to a draft PR.
I've just noticed there's a bug (also present on lhoursquentin/sed-bin) whereby a malformed expression (./sed 's/:/_' </etc/passwd) just hangs.
I've just noticed there's a bug (also present on lhoursquentin/sed-bin) whereby a malformed expression (
./sed 's/:/_' </etc/passwd) just hangs.
Yeah unfortunately malformed scripts are not handled correctly for the most part, I have a small note in the README regarding this:
The translator does not handle invalid sed scripts, it will just generate invalid C code which will probably fail to compile, make sure you can run your script with an actual sed implementation before attempting to translate it.
Proper error checking in the translator is possible but it would require keeping a lot more state in memory. For instance the translator isn't even aware of curly brackets nesting depth, it just prints them right away and forgets about them.
In this specific case (s/:/_) though it cannot even finish translation an is trapped in an infinite loop, I just fixed it in commit 72f5a89404d866d0561f8168deeb0b26814fff50 since it was straightforward and clarified in the README that infinite loop is a possibility when trying to translate invalid scripts.
That would be easily checked using "if system_sed script </dev/null 2>/dev/null" — compared to invoking gcc, it's practically free. But, again, this needs design decisions (have a SED_REAL_SED var?)
All right, I had the chance to take a more in depth look, it's really nice!
I'm just wondering if it wouldn't even be easier to merge the two together and instead have a SED_CACHED env var to trigger the cached version, so that we can keep a single sed script (which could also be sourced), what do you think?
It's also worth noting that the cached version seems slower than the non-cached one when dealing with big sed scripts, I tried with par.sed and the cached version was roughly two times slower.
That would be easily checked using "if system_sed script </dev/null 2>/dev/null" — compared to invoking gcc, it's practically free. But, again, this needs design decisions (have a SED_REAL_SED var?)
My main concern with this approach is that some scripts can have side effects: creating and writing to files with the w command, reading files with r (which can remove data from a FIFO for instance), or with the GNU extensions even run a shell with s/.*/rm some-file/e. So I'd rather let the user handle this verification process manually to avoid any surprises.
All right, I had the chance to take a more in depth look, it's really nice! I'm just wondering if it wouldn't even be easier to merge the two together and instead have a
SED_CACHEDenv var to trigger the cached version, so that we can keep a singlesedscript (which could also be sourced), what do you think?
Thanks (it's even cleaner now). Sure; a few more changes will be needed to avoid namespace pollution (BIN). I don't know if any users rely on this "API".
It's also worth noting that the cached version seems slower than the non-cached one when dealing with big sed scripts, I tried with
par.sedand the cached version was roughly two times slower.
Interesting. It turns out shell string-truncation (e.g.: s='????'; s=${1#$s}; set -- "${1%"$s"}" "$s" to generate a truncated 4-char string + rest) is heavily dependent on the #/##, %/%% combination used (even though in this context #/## and %/%% are equivalent). FWIW I found out that dash, busybox sh (and even Bash) all favor # + %% (head truncation) and % + ## (tail truncation). However even the best tail-truncation is several times slower than head-.
It just so happened that the combination I was initially using (% / #) is one of the slowest. Like, orders of magnitude slower...
I've added a hyperfine example at the end of sed-cached to test all # / ## / % / %% combinations. I would drop script-tail hashing completely, except that's the "most unique" part (e.g. your par.sed starts with a bunch of comments).
My main concern with this approach is that some scripts can have side effects: creating and writing to files with the
wcommand, reading files withr(which can remove data from a FIFO for instance), or with the GNU extensions even run a shell withs/.*/rm some-file/e. So I'd rather let the user handle this verification process manually to avoid any surprises.
I was afraid that might be the case. Speaking of which, I wonder, would it be feasible to add a sandbox mode like GNU sed (disable e/r/w)?