logstash-filter-csv
logstash-filter-csv copied to clipboard
[Docs] Add caveat about autodetect_column_names
https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html#plugins-filters-csv-autodetect_column_names
- Version: 3.0.10
When using autodetect_column_names
, if either
- logstash is stopped and restarted in the middle of reading a csv file, or
- logstash finishes reading a file with one column layout and starts reading a different file with a different column layout
then the behaviour is not what might be expected.
In the first case, the column names will be re-read from the next event where LS left off before being stopped - so the event data in that row becomes the column names for the rest of the file.
In the second case, column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.
Additionally, I think in the second case, the header line will be ingested as data, even if it is column names.
We should add a caveat in the docs to cover these scenarios.
Something along the lines of a note like:
When
autodetect_column_names
is set totrue
, the column names information is only parsed when Logstash starts. Refrain from using this setting if there's a chance Logstash will restart while in the middle of a file, or if you are ingesting multiple csv files which each have column names as the first line
Note that the new cvs codec will help for the cases of changing files with new headers.
Note that the new cvs codec will help for the cases of changing files with new headers.
Same behaviour with the csv codec:
- column names are not re-read on starting a new file, so the data in the new file is treated as if it were in the format of the previous file.
- additionally, the header line is ingested as data, even if it is column names.
input{ file{ path => "/mnt/elk-fss/es-datasets/*.csv" mode => "read" codec => csv { autodetect_column_names => true include_headers => false skip_empty_columns => true }}}
Here is the exception for a new header line: [WARN ] 2020-07-31 10:11:02.763 [[main]>worker0] elasticsearch - Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mytestindex", :routing=>nil, :_type=>"_doc"}, #LogStash::Event:0x1887f568], :response=>{"index"=>{"_index"=>"mytestindex", "_type"=>"_doc", "_id"=>"x4hapHMBG3hrlXQBhPiG", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field
Data from the new file is handled in the previous format, which of course leads to inconsistencies.
@colinsurprenant - could you elaborate on why you think the csv codec might address the case of changing files with new headers?