logstash-filter-csv icon indicating copy to clipboard operation
logstash-filter-csv copied to clipboard

[Docs] Add caveat about autodetect_column_names

Open AndyHunt66 opened this issue 5 years ago • 3 comments

https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html#plugins-filters-csv-autodetect_column_names

  • Version: 3.0.10

When using autodetect_column_names, if either

  • logstash is stopped and restarted in the middle of reading a csv file, or
  • logstash finishes reading a file with one column layout and starts reading a different file with a different column layout

then the behaviour is not what might be expected.

In the first case, the column names will be re-read from the next event where LS left off before being stopped - so the event data in that row becomes the column names for the rest of the file.

In the second case, column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.

Additionally, I think in the second case, the header line will be ingested as data, even if it is column names.

We should add a caveat in the docs to cover these scenarios.

Something along the lines of a note like:

When autodetect_column_names is set to true, the column names information is only parsed when Logstash starts. Refrain from using this setting if there's a chance Logstash will restart while in the middle of a file, or if you are ingesting multiple csv files which each have column names as the first line

AndyHunt66 avatar Feb 19 '20 10:02 AndyHunt66

Note that the new cvs codec will help for the cases of changing files with new headers.

colinsurprenant avatar Feb 21 '20 15:02 colinsurprenant

Note that the new cvs codec will help for the cases of changing files with new headers.

Same behaviour with the csv codec:

  • column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.
  • additionally, the header line is ingested as data, even if it is column names.

input{ file{ path => "/mnt/elk-fss/es-datasets/*.csv" mode => "read" codec => csv { autodetect_column_names => true include_headers => false skip_empty_columns => true }}}

Here is the exception for a new header line: [WARN ] 2020-07-31 10:11:02.763 [[main]>worker0] elasticsearch - Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mytestindex", :routing=>nil, :_type=>"_doc"}, #LogStash::Event:0x1887f568], :response=>{"index"=>{"_index"=>"mytestindex", "_type"=>"_doc", "_id"=>"x4hapHMBG3hrlXQBhPiG", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field

Data from the new file is handled in the previous format, which of course leads to inconsistencies.

vmchiran avatar Jul 31 '20 10:07 vmchiran

@colinsurprenant - could you elaborate on why you think the csv codec might address the case of changing files with new headers?

tomryanx avatar Mar 08 '21 05:03 tomryanx