miller icon indicating copy to clipboard operation
miller copied to clipboard

count using much more memory than expected

Open jgarthur opened this issue 3 years ago • 1 comments

Hi, thanks for developing this great tool!

I was working on an application of mlr count -g var1,var2, where var1 and var2 are both strings, and the input file is quite large (51GB uncompressed). I noticed the memory usage in htop growing until exceeding the uncompressed input file size. Could there be a memory leak here?

I've reproduced the issue with a minimal example containing only 1 column:

# 76 MB input file with one column, 1M unique string values with a count of 10 each
$ wc -l test.csv
10000001 test.csv
$ head -n 5 test.csv
a
A0
A1
A2
A3
$ tail -n 5 test.csv
A999995
A999996
A999997
A999998
A999999

# 2.5 GB max RSS
$ /usr/bin/time mlr --csv count -g a -o count test.csv > test_out_mlr
31.31user 9.92system 0:12.40elapsed 332%CPU (0avgtext+0avgdata 2548744maxresident)k
0inputs+21272outputs (0major+618198minor)pagefaults 0swaps

# 1.9 MB max RSS
$ /usr/bin/time cat test.csv | tail -n +2 | gawk '{c[$1] += 1} END {for (x in c) {print x "," c[x]}}' > test_out_awk
0.00user 0.25system 0:03.38elapsed 7%CPU (0avgtext+0avgdata 1896maxresident)k
0inputs+0outputs (0major+93minor)pagefaults 0swaps

# same results modulo header and sort order
$ diff <(sort test_out_awk) <(sort test_out_mlr)
1000000a1000001
> a,count

jgarthur avatar May 27 '22 20:05 jgarthur

Hi @jgarthur !!

Thanks for submitting this! :)

I think the test.csv example may be due in part to a "baseline RSS" rather than a leak issue ... there are three things I'm aware of: one is that Go executables are statically linked; another is that the entire Go runtime is present in that linkage; the third is that the (dense, not sparse) LR1-parser matrices take up quite a bit of memory. The first two issues are intrinsic to Go; the third, due to my use of GOCC -- a "someday" project would be to try out GOGGL and see if that helps.

I think the question of leak-or-no-leak depends on the number of unique var1,var2 pairs -- if there are a few this sounds very leaky; if there are many, this sounds like it might be associated with hash-map overhead associated with tracking counts.

I will try

mlr --csv head -n 100 then count -g  a -o count test.csv
mlr --csv head -n 1000 then count -g  a -o count test.csv
mlr --csv head -n 10000 then count -g  a -o count test.csv
mlr --csv head -n 100000 then count -g a -o count test.csv
mlr --csv head -n 1000000 then count -g a -o count test.csv
...

etc to get a sense of what's baseline RSS and what's data-dependent.

johnkerl avatar May 28 '22 01:05 johnkerl

Related to #1119

johnkerl avatar Nov 26 '22 16:11 johnkerl

I've done as much as I can on #1119; please re-open if this is still a blocking issue.

johnkerl avatar Mar 06 '23 05:03 johnkerl