miller
miller copied to clipboard
Problem merging multiple files containing a common field: columns are lost in the process.
I have the following files and want to merge them by their unixTime field:
timestamp.csv
i2c-02.timestamp;unixTime;minValue;averageValue;maxValue
i2c-02.timestamp;2025-03-20 18:00:31.40;61230.000;61230.000;61230.000
i2c-02.timestamp;2025-03-20 18:00:34.10;61233.000;61233.000;61233.000
i2c-02.timestamp;2025-03-20 18:00:34.20;61233.000;61233.000;61233.000
i2c-02.timestamp;2025-03-20 18:00:34.30;61233.000;61233.000;61233.000
nodeid.csv
i2c-02.nodeid;unixTime;minValue;averageValue;maxValue
i2c-02.nodeid;2025-03-20 18:00:31.40;132.000;132.000;132.000
i2c-02.nodeid;2025-03-20 18:00:34.10;133.000;133.000;133.000
i2c-02.nodeid;2025-03-20 18:00:34.20;133.000;133.000;133.000
i2c-02.nodeid;2025-03-20 18:00:34.30;133.000;133.000;133.000
variable.csv
i2c-02.variable;unixTime;minValue;averageValue;maxValue
i2c-02.variable;2025-03-20 18:00:31.40;4.000;4.000;4.000
i2c-02.variable;2025-03-20 18:00:34.10;1.000;1.000;1.000
i2c-02.variable;2025-03-20 18:00:34.20;2.000;2.000;2.000
i2c-02.variable;2025-03-20 18:00:34.30;3.000;3.000;3.000
value.csv
i2c-02.value;unixTime;minValue;averageValue;maxValue
i2c-02.value;2025-03-20 18:00:31.40;1.767;1.767;1.767
i2c-02.value;2025-03-20 18:00:34.10;4.772;4.772;4.772
i2c-02.value;2025-03-20 18:00:34.20;17.029;17.029;17.029
i2c-02.value;2025-03-20 18:00:34.30;48.031;48.031;48.031
You'll notice that minValue, averageValue and maxValue are always identical inside each file, so I can keep one only. I would like to obtain merged.csv (note the comma separator this time):
unixTime,timestamp,nodeid,variable,value
2025-03-20 18:00:31.40,61230.000,132.000,4.000,1.767
2025-03-20 18:00:34.10,61233.000,133.000.1.000,4.772
2025-03-20 18:00:34.20,61233.000,133.000,2.000,17.029
2025-03-20 18:00:34.30,61233.000,133.000,3.000,48.031
I feel like I have tried everything for several hours, and will spare you all the different strategies I've used, but I think from the manual that the following command should work (except for the removal of the unwanted fields, I think I should get all fields in a single file), yet the output is incorrect:
$ mlr --csv --fs ";" join -j unixTime -f timestamp.csv \
then join -j unixTime -f nodeid.csv \
then join -j unixTime -f variable.csv value.csv
unixTime;i2c-02.variable;minValue;averageValue;maxValue;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:31.40;i2c-02.variable;1.767;1.767;1.767;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:34.10;i2c-02.variable;4.772;4.772;4.772;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:34.20;i2c-02.variable;17.029;17.029;17.029;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:34.30;i2c-02.variable;48.031;48.031;48.031;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:34.40;i2c-02.variable;1.977;1.977;1.977;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:34.70;i2c-02.variable;17.200;17.200;17.200;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:34.80;i2c-02.variable;17.140;17.140;17.140;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:34.90;i2c-02.variable;17.459;17.459;17.459;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
2025-03-20 18:00:35.00;i2c-02.variable;17.140;17.140;17.140;i2c-02.nodeid;i2c-02.timestamp;i2c-02.value
I'm afraid the first field with its periods in the strings might be the issue, but I'm not sure, and I tried cutting it in each file using mlr, but failed. Or is it because other fields have the same names between files?
What am I doing wrong? I would be very grateful for any pointers; my alternative would be to use awk but I am even worse with it and this would require more complex Bash scripting to join files in pairs and then shift the key field and fields to keep.
I am using mlr 6.13.0.
Or is it because other fields have the same names between files?
Ok, so this is absolutely the reason, I can see it if I rename them manually. Would there be a way to cover that issue with mlr without manually editing files or creating temporary files?
After many trials and errors, this seems to be what I want:
cat value.csv | mlr --csvlite --fs ";" --ofs "," \
rename 'averageValue,value' then \
join -f variable.csv -j unixTime then \
rename 'averageValue,variable' then \
join -f nodeid.csv -j unixTime then \
rename 'averageValue,nodeid' then \
join -f timestamp.csv -j unixTime then \
rename 'averageValue,timestamp' then \
join -f value.csv -j unixTime then \
cut -f unixTime,timestamp,nodeid,variable,value
unixTime,timestamp,nodeid,variable,value
2025-03-20 18:00:31.40,61230.000,132.000,4.000,1.767
2025-03-20 18:00:34.10,61233.000,133.000,1.000,4.772
2025-03-20 18:00:34.20,61233.000,133.000,2.000,17.029
2025-03-20 18:00:34.30,61233.000,133.000,3.000,48.031
But it is not exactly clear to me why I need to pipe that to a cat value.csv and yet still explicitly call value.csv in the mlr command. If I don't, the command just hangs, probably waiting for a stream?
What about to use paste
paste -d ',' \
<(mlr --csv --ifs ";" cut -f unixTime,minValue then label unixTime,timestamp timestamp.csv) \
<(mlr --csv --ifs ";" cut -f minValue then label nodeid nodeid.csv) \
<(mlr --csv --ifs ";" cut -f minValue then label variable variable.csv) \
<(mlr --csv --ifs ";" cut -f minValue then label value value.csv)
It gives you
unixTime,timestamp,nodeid,variable,value
2025-03-20 18:00:31.40,61230.000,132.000,4.000,1.767
2025-03-20 18:00:34.10,61233.000,133.000,1.000,4.772
2025-03-20 18:00:34.20,61233.000,133.000,2.000,17.029
2025-03-20 18:00:34.30,61233.000,133.000,3.000,48.031
Thanks for your answer. Unfortunately I'm afraid this assumes that unixTime is always exactly the same between individual files, and that all files have the same number of rows. In reality, I unfortunately have missing data in some files and not the others due to network issues, so I end up with a bit of data loss that paste ignores (which cascades into non-matching merges):
My command above:
unixTime,timestamp,nodeid,variable,value
2025-03-22 11:00:43.50,232.000,36041.000,15.000,15.980
2025-03-22 11:00:43.60,232.000,36041.000,16.000,75.984
2025-03-22 11:00:43.70,232.000,36041.000,17.000,14.106
2025-03-22 11:00:48.20,332.000,36047.000,1.000,4.882
2025-03-22 11:00:48.30,332.000,36047.000,2.000,14.290
2025-03-22 11:00:48.40,332.000,36047.000,3.000,79.921
2025-03-22 11:00:48.50,332.000,36047.000,4.000,6.828
2025-03-22 11:00:49.70,332.000,36047.000,15.000,17.320
2025-03-22 11:00:49.80,332.000,36047.000,16.000,63.385
2025-03-22 11:00:49.90,332.000,36047.000,17.000,4.766
With paste:
unixTime,timestamp,nodeid,variable,value
2025-03-22 11:00:43.50,36041.000,232.000,15.000,14.106
2025-03-22 11:00:43.60,36041.000,232.000,16.000,4.882
2025-03-22 11:00:43.70,36041.000,232.000,17.000,14.290
2025-03-22 11:00:48.20,36047.000,332.000,1.000,79.921
2025-03-22 11:00:48.30,36047.000,332.000,2.000,6.828
2025-03-22 11:00:48.40,36047.000,332.000,3.000,17.320
2025-03-22 11:00:48.50,36047.000,332.000,4.000,63.385
2025-03-22 11:00:49.70,36047.000,332.000,15.000,4.766
2025-03-22 11:00:49.80,36047.000,332.000,16.000,
2025-03-22 11:00:49.90,36047.000,332.000,17.000,
The issue is not only about the missing values at the end, but also some values are shifted and the correspondance (in terms of unixTime) between field rows is lost. But I still believe my command is not optimal. But the most important question I'm wondering about is whether this is robust in this scenario.
Hi @Kabouik if you run
for file in timestamp nodeid variable value; do
mlr --csv --ifs ";" \
cut -f unixTime,minValue \
then rename minValue,"${file}" \
"${file}.csv" > "${file}_j.csv"
done
You can run join on the output
mlr --csv \
join -f timestamp_j.csv -j unixTime then \
join -f nodeid_j.csv -j unixTime then \
join -f variable_j.csv -j unixTime value_j.csv
and get
+------------------------+----------+---------+-----------+--------+
| unixTime | variable | nodeid | timestamp | value |
+------------------------+----------+---------+-----------+--------+
| 2025-03-20 18:00:31.40 | 4.000 | 132.000 | 61230.000 | 1.767 |
| 2025-03-20 18:00:34.10 | 1.000 | 133.000 | 61233.000 | 4.772 |
| 2025-03-20 18:00:34.20 | 2.000 | 133.000 | 61233.000 | 17.029 |
| 2025-03-20 18:00:34.30 | 3.000 | 133.000 | 61233.000 | 48.031 |
+------------------------+----------+---------+-----------+--------+
related to #1821 ?