miller Too many open files while splitting

I'm trying to filter my data by group. The first step of this involves splitting the data with the verb split and using the "-g" option. However, I get the following error":

mlr: open split_{group id stuff}.pprint: too many open files

In bash, you can replicate this error with:

(echo a; seq 10000) | mlr --pprint split -g a

I get the following error, though I imagine the exact "split" the error occurs on may be machine-dependent:

mlr: open split_1019.pprint: too many open files

Oct 17 '22 20:10 holmescharles

Hi @holmescharles !!

One option is to use ulimit to increase the number of open files: https://github.com/johnkerl/miller/issues/299

What Miller really needs is a process-internal LRU cache of some sort so it wouldn't need to keep an open descriptor for every single file, but this is a development to-do ...

Oct 18 '22 03:10 johnkerl

FYI also note that the same limit (set with ulimit) is not unique to miller and applies to other tools like split or awk.

Incidentally, I've hit that limit before with awk and the solution was to close unneeded files, freeing the file handles. Could miller do that?

Nov 02 '22 15:11 janxkoci

@janxkoci yes indeed, we are talking about the same thing! :)

What Miller really needs is a process-internal LRU cache of some sort so it wouldn't need to keep an open descriptor for every single file, but this is a development to-do ...

Nov 02 '22 16:11 johnkerl