frawk
frawk copied to clipboard
frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process.
frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process.
Reading 2 input files from a pipe and writing to stdout:
❯ time gawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2
}
}
}' > /dev/null
real 0m17,143s
user 0m17,215s
sys 0m0,184s
❯ time gawk -b '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2
}
}
}' > /dev/null
real 0m17,341s
user 0m17,395s
sys 0m0,223s
❯ time mawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2
}
}
}' > /dev/null
real 0m7,453s
user 0m7,511s
sys 0m0,175s
❯ time frawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2
}
}
}' > /dev/null
real 0m20,671s
user 0m20,621s
sys 0m0,091s
Reading 2 input files from a pipe and no writing:
❯ time gawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
#print line1 "\t" line2
}
}
}' > /dev/null
real 0m12,966s
user 0m13,049s
sys 0m0,168s
❯ time gawk -b '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
#print line1 "\t" line2
}
}
}' > /dev/null
real 0m13,009s
user 0m13,075s
sys 0m0,181s
❯ time mawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
#print line1 "\t" line2
}
}
}' > /dev/null
real 0m4,260s
user 0m4,298s
sys 0m0,201s
❯ time frawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
#print line1 "\t" line2
}
}
}' > /dev/null
real 0m14,207s
user 0m14,169s
sys 0m0,023s
Reading 2 input files from a pipe and writing output to one pipe:
❯ time gawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
write_cmd = "cat > /dev/null";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2 | write_cmd
}
}
}'
real 0m38,797s
user 0m28,519s
sys 0m37,103s
❯ time gawk -b '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
write_cmd = "cat > /dev/null";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2 | write_cmd
}
}
}'
real 0m37,999s
user 0m27,634s
sys 0m36,721s
❯ time mawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
write_cmd = "cat > /dev/null";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2 | write_cmd
}
}
}'
real 0m8,415s
user 0m8,442s
sys 0m0,305s
❯ time frawk '
BEGIN {
read_cmd1 = "yes | head -n 100000000";
read_cmd2 = "yes | head -n 100000000";
write_cmd = "cat > /dev/null";
while ( (read_cmd1 | getline line1) > 0 ) {
if ( (read_cmd2 | getline line2) > 0 ) {
print line1 "\t" line2 | write_cmd
}
}
}'
real 0m28,593s
user 0m40,432s
sys 0m15,867s
In the last case frawk uses even 200% CPU when it needs to read and write to a pipe openen by the awk process.
Fascinating! Thanks for filing this issue. I think mawk, frawk, and gawk may all be buffering their file IO a bit differently. I can try to take a look at what mawk is doing and compare it with the Rust standard library.
I assume mawk might not do line buffering in this case.
The original code I had is actually decompressing gziiped files and writing out gzipped files via those *_cmd
commands:
https://github.com/aertslab/single_cell_toolkit/blob/master/barcode_10x_scatac_fastqs.sh
I definitely think that line buffering on output was a big issue in the last benchmark. That's fixed in the latest commit; reading is still slower though.
Could CommanReader be used for reading from pipes to solve this issue? https://docs.rs/grep-cli/latest/grep_cli/struct.CommandReader.html
Feel free to try things out on that latest commit: I don't notice any improvement (and wrapping in a BufRead
doesn't seem to help either, unfortunately).
Probably it is not related to reading from a pipe, but just getline that is slow. When reading from a premade file directly (with getline) instead of a piped filehandle, the slowdown is the same.
❯ time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null
real 0m8.219s
user 0m8.313s
sys 0m0.421s
❯ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 } }' > /dev/null
real 0m23.011s
user 0m23.014s
sys 0m0.491s
❯ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 > "/dev/null" } }'
real 0m26.739s
user 0m26.760s
sys 0m0.512s
❯ time yes | head -n 100000000 | frawk 'BEGIN { write_cmd = "cat > /dev/null"; while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 | write_cmd } }'
real 0m26.507s
user 0m26.519s
sys 0m0.564s
# Create file first.
❯ yes | head -n 100000000 > 100000000.txt
❯ time frawk 'BEGIN { while ( (getline line1 < "100000000.txt") > 0 ) { print line1 } }' > /dev/null
real 0m23.025s
user 0m22.778s
sys 0m0.148s
Also now that CommandReader is used, it should be relatively straightforward to be able to handle compressed text files automagically if requested by constructing a CommandReader with the correct decompression tool.