miller icon indicating copy to clipboard operation
miller copied to clipboard

miller evaluates all records even when not needed

Open balki opened this issue 1 year ago • 5 comments

In the below example, only first 5 records are needed. But system in put has run for all the records as we can see in the tmp file.

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p  put '$v = system("echo hello; echo err >> /tmp/1")' then head -n 5; nl /tmp/1
index v
1     hello
2     hello
3     hello
4     hello
5     hello
     1  err
     2  err
     3  err
     4  err
     5  err
     6  err
     7  err
     8  err
     9  err
    10  err

When in head is moved ahead of put, it works fine.

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1 
index v
1     hello
2     hello
3     hello
4     hello
5     hello
     1  err
     2  err
     3  err
     4  err
     5  err

It appears that each verb is run on all records before moving to rest. Can miller be made lazy? I understand it will not be possible when stats/grouping is used. But for simple case I thought it wold work lazy.

balki avatar Sep 19 '24 19:09 balki

There is indeed laziness and some early-out logic when head is in the verb list -- however there is some batching (default 500 rows at a time) which was necessary for performance in the port from C to Go ....

  • https://github.com/johnkerl/miller/blob/main/README-dev.md
  • https://github.com/johnkerl/miller/pull/779

If we're getting readahead of over 500 records then that's a bug though ...

johnkerl avatar Sep 19 '24 19:09 johnkerl

(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)

johnkerl avatar Sep 19 '24 19:09 johnkerl

OTOH this looks odd to me:

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1 

🤔 👀

johnkerl avatar Sep 19 '24 19:09 johnkerl

(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)

Thanks for clarifying. Makes sense. I was running below in the logs and found it took a long time (11 seconds) when head was used after put but the other way was instant. I think I should just move filter and head as early as possible.

❯ mlr --l2p --tz America/Toronto put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then filter '$status == 200' then flatten t
hen cut -of ts,cn,request.remote_ip,request.uri then head caddy.log | wc -l 
11

~/tmp/millerexp took 11s
❯ mlr --l2p --tz America/Toronto filter '$status == 200' then head then put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then
 filter '$status == 200' then flatten then cut -of ts,cn,request.remote_ip,request.uri caddy.log | wc -l                                                                                      
11

balki avatar Sep 19 '24 20:09 balki

it took a long time (11 seconds) when head was used after put but the other way was instant

@balki this needs fixing for sure.

johnkerl avatar Sep 22 '24 20:09 johnkerl