logstash-filter-csv [Docs] Add caveat about autodetect_column

https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html#plugins-filters-csv-autodetect_column_names

Version: 3.0.10

When using autodetect_column_names, if either

logstash is stopped and restarted in the middle of reading a csv file, or
logstash finishes reading a file with one column layout and starts reading a different file with a different column layout

then the behaviour is not what might be expected.

In the first case, the column names will be re-read from the next event where LS left off before being stopped - so the event data in that row becomes the column names for the rest of the file.

In the second case, column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.

Additionally, I think in the second case, the header line will be ingested as data, even if it is column names.

We should add a caveat in the docs to cover these scenarios.

Something along the lines of a note like:

When autodetect_column_names is set to true, the column names information is only parsed when Logstash starts. Refrain from using this setting if there's a chance Logstash will restart while in the middle of a file, or if you are ingesting multiple csv files which each have column names as the first line

Feb 19 '20 10:02 AndyHunt66

Note that the new cvs codec will help for the cases of changing files with new headers.

Feb 21 '20 15:02 colinsurprenant

Note that the new cvs codec will help for the cases of changing files with new headers.

Same behaviour with the csv codec:

column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.
additionally, the header line is ingested as data, even if it is column names.

input{ file{ path => "/mnt/elk-fss/es-datasets/*.csv" mode => "read" codec => csv { autodetect_column_names => true include_headers => false skip_empty_columns => true }}}

Here is the exception for a new header line: [WARN ] 2020-07-31 10:11:02.763 [[main]>worker0] elasticsearch - Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mytestindex", :routing=>nil, :_type=>"_doc"}, #LogStash::Event:0x1887f568], :response=>{"index"=>{"_index"=>"mytestindex", "_type"=>"_doc", "_id"=>"x4hapHMBG3hrlXQBhPiG", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field

Data from the new file is handled in the previous format, which of course leads to inconsistencies.

Jul 31 '20 10:07 vmchiran

@colinsurprenant - could you elaborate on why you think the csv codec might address the case of changing files with new headers?

Mar 08 '21 05:03 tomryanx

logstash-filter-csv
logstash-filter-csv copied to clipboard

[Docs] Add caveat about autodetect_column_names

logstash-filter-csv logstash-filter-csv copied to clipboard

[Docs] Add caveat about autodetect_column_names

logstash-filter-csv
logstash-filter-csv copied to clipboard