frawk icon indicating copy to clipboard operation
frawk copied to clipboard

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process.

Open ghuls opened this issue 1 year ago • 6 comments

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process.

Reading 2 input files from a pipe and writing to stdout:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


     while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m17,143s
user    0m17,215s
sys     0m0,184s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


     while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m17,341s
user    0m17,395s
sys     0m0,223s

❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";

    while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m7,453s
user    0m7,511s
sys     0m0,175s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";

     while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m20,671s
user    0m20,621s
sys     0m0,091s

Reading 2 input files from a pipe and no writing:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m12,966s
user    0m13,049s
sys     0m0,168s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m13,009s
user    0m13,075s
sys     0m0,181s

❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m4,260s
user    0m4,298s
sys     0m0,201s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m14,207s
user    0m14,169s
sys     0m0,023s

Reading 2 input files from a pipe and writing output to one pipe:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m38,797s
user    0m28,519s
sys     0m37,103s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m37,999s
user    0m27,634s
sys     0m36,721s


❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m8,415s
user    0m8,442s
sys     0m0,305s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m28,593s
user    0m40,432s
sys     0m15,867s

In the last case frawk uses even 200% CPU when it needs to read and write to a pipe openen by the awk process.

ghuls avatar Feb 28 '23 14:02 ghuls

Fascinating! Thanks for filing this issue. I think mawk, frawk, and gawk may all be buffering their file IO a bit differently. I can try to take a look at what mawk is doing and compare it with the Rust standard library.

ezrosent avatar Mar 02 '23 07:03 ezrosent

I assume mawk might not do line buffering in this case.

The original code I had is actually decompressing gziiped files and writing out gzipped files via those *_cmd commands: https://github.com/aertslab/single_cell_toolkit/blob/master/barcode_10x_scatac_fastqs.sh

ghuls avatar Mar 02 '23 09:03 ghuls

I definitely think that line buffering on output was a big issue in the last benchmark. That's fixed in the latest commit; reading is still slower though.

ezrosent avatar Mar 21 '23 05:03 ezrosent

Could CommanReader be used for reading from pipes to solve this issue? https://docs.rs/grep-cli/latest/grep_cli/struct.CommandReader.html

ghuls avatar Jul 25 '23 09:07 ghuls

Feel free to try things out on that latest commit: I don't notice any improvement (and wrapping in a BufRead doesn't seem to help either, unfortunately).

ezrosent avatar Jul 26 '23 05:07 ezrosent

Probably it is not related to reading from a pipe, but just getline that is slow. When reading from a premade file directly (with getline) instead of a piped filehandle, the slowdown is the same.

❯  time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null

real    0m8.219s
user    0m8.313s
sys     0m0.421s


❯  time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 } }' > /dev/null

real    0m23.011s
user    0m23.014s
sys     0m0.491s


❯  time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 > "/dev/null" } }'

real    0m26.739s
user    0m26.760s
sys     0m0.512s


❯  time yes | head -n 100000000 | frawk 'BEGIN { write_cmd = "cat > /dev/null"; while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 | write_cmd } }'

real    0m26.507s
user    0m26.519s
sys     0m0.564s

# Create file first.
❯  yes | head -n 100000000 > 100000000.txt

❯  time frawk 'BEGIN { while ( (getline line1 < "100000000.txt") > 0 ) { print line1 } }' > /dev/null

real    0m23.025s
user    0m22.778s
sys     0m0.148s

Also now that CommandReader is used, it should be relatively straightforward to be able to handle compressed text files automagically if requested by constructing a CommandReader with the correct decompression tool.

ghuls avatar Jul 26 '23 09:07 ghuls