miller icon indicating copy to clipboard operation
miller copied to clipboard

Using Miller in a join prepipe

Open mfernandez-turnto opened this issue 4 years ago • 8 comments

I guess is more of a question than an issue, but I don't know of other places to ask. BTW Miller is the most wonderful thing that I've encountered in a a long while, I use it every day.

Here is the thing, lets say I have two files I want to join by a colun name identifier, lets call them file1 and file2.

    mlr --csv join -u --lp f2_ --rp f1_ -j identifier -f file2 file1

works great if none of the files contains a list values in identifier which would require nest, if just one (for simplicity lets say file1) has nested values in identifier, it is quite easy to fix this without creating additional files:

    mlr --csv nest --evar ',' -f identifier then join -u --lp f2_ --rp f1_ -j identifier -f file2 file1

but when both files have commas I have to resort to producing an intermediate file. I thought:

    mlr --csv nest --evar ',' -f identifier then join -u --prepipe 'mlr --csv nest --evar "," -f identifier' --lp f2_ --rp f1_ -j identifier -f file2 file1

could do what I wanted, but it did not work. Is there a standard way of doing that?

mfernandez-turnto avatar Jul 06 '20 23:07 mfernandez-turnto

Miller is the most wonderful thing that I've encountered in a a long while, I use it every day.

Thanks!!! :)

Is there a standard way of doing that?

No, I think not -- I never thought of this combination of nest and join. :^/

johnkerl avatar Jul 07 '20 02:07 johnkerl

Well, the temp file is OK, still I could not guess why I cannot use mlr itself as a prepipe, is it because of the "<" redirect restriction?, just curiosity if you know off hand.

mfernandez-turnto avatar Jul 07 '20 18:07 mfernandez-turnto

I have a similar use case and am currently working around the temporary file using process substitution:

mlr --csv join -f <(mlr --csv cat left.csv) -j id right.csv

sonicdoe avatar Nov 05 '20 13:11 sonicdoe

@mfernandez-turnto @sonicdoe it looks like process substituation is the right thing to do -- ? I'm closing this out but please let me know if I'm mistaken and we can re-open -- thank you!

johnkerl avatar Jan 01 '22 00:01 johnkerl

I agree, especially because process substitution keeps all of Miller’s flexibility. Should we document this in Questions about joins, though?

sonicdoe avatar Jan 02 '22 14:01 sonicdoe

Process substitution with mlr is not working for me on Windows with Cygwin.

uname -r
3.4.6-1.x86_64

csvdb="$(printf "a,b,c\n1,2,3\n4,5,6\n7,8,9")"

# https://miller.readthedocs.io/en/latest/streaming-and-memory/
# Fully streaming verbs
mlr --csv cat <(echo "$csvdb")
mlr: open /proc/self/fd/11: The system cannot find the path specified..

# Non-streaming, retaining all records
mlr --csv unsparsify <(echo "$csvdb")
mlr: open /proc/self/fd/11: The system cannot find the path specified..

# Process substitution works with commands other than mlr
cat <(echo "$csvdb")
a,b,c
1,2,3
4,5,6
7,8,9

# Works as expected when using a real file
printf "$csvdb" > ./csvdb.csv

mlr --csv cat ./csvdb.csv
a,b,c
1,2,3
4,5,6
7,8,9

mlr --csv unsparsify ./csvdb.csv
a,b,c
1,2,3
4,5,6
7,8,9

The same mlr commands work as expected for me if I switch to a Linux machine. Since process substitution works with everything except mlr, is there something different about the windows build?

railgauge avatar Apr 20 '23 21:04 railgauge

@railgauge I'll check it out.

Windows is definitely different in many ways -- see also https://miller.readthedocs.io/en/latest/miller-on-windows/ -- but Cygwin smooths out many of those differences.

Can we first check, what's your mlr version output?

johnkerl avatar Apr 20 '23 21:04 johnkerl

Thanks! I have observed this behavior with mlr 6.7.0-dev (git clone https://github.com/johnkerl/miller) and mlr 6.7.0 compiled from source.

I only compiled from source because I had a separate issue: when when I tried using miller from choco install miller or downloading the latest pre-compiled binary from github with wget https://github.com/johnkerl/miller/releases/download/v6.7.0/miller-6.7.0-windows-amd64.zip via Cygwin zsh I always get an error about a file not being found even though it does exist. Also doesn't seem to help if I specify the full file paths eg:

/cygdrive/c/Users/username/Downloads/tmp/mlr.exe --csv cat /cygdrive/c/Users/username/Downloads/tmp/csvdb.csv
C:\Users\username\Downloads\tmp\mlr.exe :  The system cannot find the path specified.

/cygdrive/c/ProgramData/chocolatey/bin/mlr.exe --csv cat ./csvdb.csv
C:\ProgramData\chocolatey\lib\miller\tools\mlr.exe :  The system cannot find the path specified.

The Windows path includes C:\ProgramData\chocolatey\bin

This seems to be something about paths since the error states it is looking for a Windows path rather than a cygwin path. If I use the choco or pre-compiled binaries in powershell they work with regular file input (also mlr 6.7.0), just not from bash/zsh/fish unless I compile from source.

In zsh I just tried $(cygpath -w ./mlr.exe) --csv cat ./csvdb.csv with the pre-compiled miller and this was able to produce expected output, so I think this supports the path theory. Still no luck with getting process substitution to work.

railgauge avatar Apr 21 '23 00:04 railgauge