miller evaluates all records even when not needed
In the below example, only first 5 records are needed. But system in put has run for all the records as we can see in the tmp file.
❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p put '$v = system("echo hello; echo err >> /tmp/1")' then head -n 5; nl /tmp/1
index v
1 hello
2 hello
3 hello
4 hello
5 hello
1 err
2 err
3 err
4 err
5 err
6 err
7 err
8 err
9 err
10 err
When in head is moved ahead of put, it works fine.
❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1
index v
1 hello
2 hello
3 hello
4 hello
5 hello
1 err
2 err
3 err
4 err
5 err
It appears that each verb is run on all records before moving to rest. Can miller be made lazy? I understand it will not be possible when stats/grouping is used. But for simple case I thought it wold work lazy.
There is indeed laziness and some early-out logic when head is in the verb list -- however there is some batching (default 500 rows at a time) which was necessary for performance in the port from C to Go ....
- https://github.com/johnkerl/miller/blob/main/README-dev.md
- https://github.com/johnkerl/miller/pull/779
If we're getting readahead of over 500 records then that's a bug though ...
(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)
OTOH this looks odd to me:
❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1
🤔 👀
(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)
Thanks for clarifying. Makes sense. I was running below in the logs and found it took a long time (11 seconds) when head was used after put but the other way was instant. I think I should just move filter and head as early as possible.
❯ mlr --l2p --tz America/Toronto put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then filter '$status == 200' then flatten t
hen cut -of ts,cn,request.remote_ip,request.uri then head caddy.log | wc -l
11
~/tmp/millerexp took 11s
❯ mlr --l2p --tz America/Toronto filter '$status == 200' then head then put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then
filter '$status == 200' then flatten then cut -of ts,cn,request.remote_ip,request.uri caddy.log | wc -l
11
it took a long time (11 seconds) when head was used after put but the other way was instant
@balki this needs fixing for sure.