logstash-filter-csv icon indicating copy to clipboard operation
logstash-filter-csv copied to clipboard

`autodetect_column_names` does not work with multiple worker threads

Open danhermann opened this issue 6 years ago • 6 comments

There's a race condition with the autodetect_column_names feature when there is more than one worker thread. The filter assumes that the first line of the CSV contains the column names but with multiple worker threads, the filter may receive lines in a different order than they are presented in the input. skip_header may have a similar problem.

Noticed while debugging the java.lang.ArrayIndexOutOfBoundsException that was mentioned in this blog post:

https://mikehillwig.com/2018/02/23/making-peace-with-logstash-part-2-parsing-a-csv/

danhermann avatar Mar 07 '18 17:03 danhermann

This is likely a result of ordering on logstash 6+ not being guaranteed when inserting into the queue between inputs and filters+outputs. Proper fix requires synchronization across all threads, essentially a rearchitecture of a big portion of the filter. The only current workaround is indeed setting workers => 1, features like autodetect_column_names shouldn't rely on event ordering, as we don't guarantee it, specially for workers > 1.

jsvd avatar Mar 27 '18 16:03 jsvd

thanks for pointing this out, it helped to fix my issue that autodetect_column_names always messed up my mapping. I've set my workers => 1 to fix my issue.

However, currently I use config in "logstash.yml" to set "pipeline.workers: 1", it impacted every pipeline, is there any configuration item i could use in a specfic pipeline.conf? because by doing that i could only use 1 worker for csv input that needs autodetect_column_names feature.

another issue is when i have 2 files, each file has a header, the header of second file will still be loaded, is there any way to deal with that?

siben168 avatar Apr 05 '18 15:04 siben168

@siben168, you can set the number of workers on each pipeline in the pipelines.yml file. See more details here: https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html

Unfortunately, as the filter is currently written, I don't know of a way to handle multiple files where each one has its own header file.

danhermann avatar Apr 05 '18 19:04 danhermann

I'm experiencing this in logstash 5.6.2 as well.

pmb311 avatar May 15 '18 18:05 pmb311

Note that the new csv codec should be more appropriate for this - in particular, when paired with the file input it will also use a separate codec instance per-file thus able to correctly adjust the columns per potentially different files.

colinsurprenant avatar Feb 21 '20 16:02 colinsurprenant