db-benchmark icon indicating copy to clipboard operation
db-benchmark copied to clipboard

awk

Open JohannesBuchner opened this issue 4 years ago • 11 comments

awk is a small DSL which can parse texts relatively quickly. It is installed by default on many unix-based systems, requires little code, and is easy to integrate in shell script pipelines.

I placed some solutions for the groupby questions here: https://gist.github.com/JohannesBuchner/442e09b7c77c7150a4885c715eb17e6b Some of them may be correct.

mawk used to be faster than gawk, not sure this is still significant.

The median-related question has sorting in the solution, which can be parallelized. Not sure if there is a more elegant solution.

JohannesBuchner avatar Aug 21 '20 14:08 JohannesBuchner

This should work OK for very large datasets, in particular those much larger than RAM.

JohannesBuchner avatar Aug 21 '20 15:08 JohannesBuchner

Thank you, will try it out. AFAIU it prints result to stdout. What is the best way to print it to a in-memory variable? piping into file on a ram-disk? In the last question, there should be also count by group, not just sum.

jangorecki avatar Aug 21 '20 15:08 jangorecki

Not sure I understand, stdout is in RAM. If you want to store it in a python program, perhaps subprocess.check_output is easiest.

JohannesBuchner avatar Aug 21 '20 16:08 JohannesBuchner

Updated the last command to include count.

JohannesBuchner avatar Aug 21 '20 16:08 JohannesBuchner

For very large responses, perhaps reading with a pipe (also possible with subprocess) is useful, to avoid using much memory.

JohannesBuchner avatar Aug 21 '20 16:08 JohannesBuchner

The problem is that printing out to console will add an overhead, thus piping output into file should be preferred to reduce the overhead.

jangorecki avatar Aug 21 '20 16:08 jangorecki

Also each single command read data from disk, this is another overhead that should be reduced. Ideally to read data once and then run all commands in sequence producing output files of each query.

jangorecki avatar Aug 21 '20 17:08 jangorecki

OK, if you want to remove the io time, ramdisks are probably a good solution.

JohannesBuchner avatar Aug 21 '20 17:08 JohannesBuchner

I am not sure whether you want to look at the output or not. If not, then you can pipe it to /dev/null, which will avoid the console printing overhead.

JohannesBuchner avatar Aug 21 '20 17:08 JohannesBuchner

Any idea if this is the most recent version? https://github.com/ploxiln/mawk-2

jangorecki avatar Aug 26 '20 09:08 jangorecki

I simply installed the ubuntu package, which is mawk 1.3.3.

JohannesBuchner avatar Aug 26 '20 10:08 JohannesBuchner