nfdump icon indicating copy to clipboard operation
nfdump copied to clipboard

Header not CSV but data is CSV with -o <fmt>

Open mkgvt opened this issue 2 years ago • 4 comments

Specifying the output format using the -o <fmt> option results in the body being CSV but the header is not. This makes further processing with CSV tools (such as xsv) more difficult than it should be as the header line is seen as a single field rather than a field per formatted item.

Example: I would have expected a comma immediately after (raw) which leads to an error from xsv:

$ nfdump -o 'fmt:%tsr,%bpp' -r /nfcapd.202306230000
Date first seen (raw)        Bpp
1687492697.088,    40
1687492696.832,    44
1687492697.856,    40
1687387348.992,   216
1687492696.320,    40
1687492699.648,    40
1687492698.368,    40
1687492648.960,   380
1687492799.488,   134
...
$ nfdump -o 'fmt:%tsr,%bpp' -r /nfcapd.202306230000 | xsv table
Date first seen (raw)        Bpp
CSV error: record 1 (line: 2, byte: 33): found record with 2 fields, but the previous record has 1 fields

I believe the issue occurs as the format is parsed (in ParseOutputFormat) and header_string is created. It looks like commas between fields should be inserted at that time.

mkgvt avatar Jun 27 '23 13:06 mkgvt

Yes - true - but it was not meant to create a csv output :) but I will check, if the change does not break other things.

phaag avatar Jul 01 '23 08:07 phaag

for csv, i just use -o csv

thezoggy avatar Jul 10 '23 18:07 thezoggy

In order to be more flexibel I propose to replace the old csv code with an user defined such as nfdump -o 'csv:%tsr,%bpp'

It needs some work to implement.

phaag avatar Oct 14 '23 15:10 phaag

I agree completely here. because ALL COLUMNS csv export is bloat for everyone's use. only limited number of columns needed for practical jobs. Having to dump all columns in every situation produces very large files resulting in lots of CO2 emissions because hardware eats energy. and wasting tons of unneeded bytes in CSV results in high CPU, RAM, Disk storage utilization :)

Currenly im adding 'header line' by simple bash script but that way is not very robust.

and second opinion: CSV export should not be dropped (marked as obsolete in current version)! JSON-export has very large overhead! it was bloated for that type of data. more overhead , more disk utilization, more I/O, more RAM and CPU intensive operations. so using CSV should be more flexible as mentioned by topic-starter and not dropped.

for example workflow: nfcapd -> nfdump -> csv -> clickhouse timeseriesDB import from csv file.... and job is done flawlessly!

MrdUkk avatar Jan 21 '24 10:01 MrdUkk

New csv format implemented. Now the headers also get properly , separated. You may add your own csv format with nfdump -o 'csv:%tsr,%bpp .... Please note the difference now between -o fmt:.. and -o csv:...

phaag avatar Jun 23 '24 10:06 phaag