logstash-input-file Allow recording the file byte offset into a field value

Allow recording the file byte offset into a field value

Open jordansissel opened this issue 9 years ago • 15 comments

Originally from: https://github.com/elasticsearch/logstash/issues/1641

Jan 23 '15 23:01 jordansissel

Similar report with specific multiline concerns in https://logstash.jira.com/browse/LOGSTASH-1044

Feb 25 '15 05:02 wiibaa

Jun 10 '15 16:06 naisanza

This is exactly what I am looking for.

Jul 07 '15 06:07 flicker581

Jul 12 '15 14:07 sergioedo

Oct 30 '15 14:10 leeeena

Oct 30 '15 14:10 freben

@jordansissel For us, it is a matter of wanting to be able to read out lines in the exact order they were ingested. Ordering on @timestamp won't be enough, since some log events are emitted with a pretty high rate and have the same timestamp. Thus, any form of incrementing number (globally, locally or other) in addition to the timestamp would be valuable.

Oct 30 '15 14:10 freben

Sequence counter plugin, not exactly byte offset but better than nothing: https://github.com/leeeena/logstash-filter-seq

Nov 03 '15 12:11 leeeena

May 18 '16 16:05 fbaligand

Oct 28 '16 14:10 doingitbigdata

I'm very new to Ruby but this patch appears to accomplish the goal of this issue.

diff -r logstash-5.0.0/vendor/bundle/jruby/1.9/gems/filewatch-0.9.0/lib/filewatch/observing_tail.rb logstash-5.0.0.eric/vendor/bundle/jruby/1.9/gems/filewatch-0.9.0/lib/filewatch/observing_tail.rb
10c10
<       def accept(line) end
---
>       def accept(line, offset) end
79c79
<             listener.accept(line)
---
>             listener.accept(line, @sincedb[watched_file.inode])

diff -r logstash-5.0.0/vendor/bundle/jruby/1.9/gems/logstash-input-file-4.0.0/lib/logstash/inputs/file.rb logstash-5.0.0.eric/vendor/bundle/jruby/1.9/gems/logstash-input-file-4.0.0/lib/logstash/inputs/file.rb
177a178
>     @offset = 0
254c255
<     attr_reader :input, :path, :data
---
>     attr_reader :input, :path, :data, :offset
266c267
<     def accept(data)
---
>     def accept(data, offset)
269c270
<       input.codec.accept(dup_adding_state(data))
---
>       input.codec.accept(dup_adding_state(data, offset))
274a276
>       event.set("offset", offset)
278c280
<     def add_state(data)
---
>     def add_state(data, offset)
279a282
>       @offset = offset
286,287c289,290
<     def dup_adding_state(line)
<       self.class.new(path, input).add_state(line)
---
>     def dup_adding_state(line, offset)
>       self.class.new(path, input).add_state(line, offset)

Nov 07 '16 02:11 internetjanitor

This is a feature that is on the radar for any future development of this plugin.

There are things to consider though, in general and about patch above.

We cannot keep adding another argument to all the methods in the call chain for each extra piece of information.
This extra information is generalised as context or provenance, i.e. stuff that describes where the data came from. Allowing for the capture of context is a feature that will eventually become available on all Logstash Inputs (or data sources) as well as Beats.
The proposal to get the offset from @sincedb[watched_file.inode] is problematic because at the time of calling @sincedb[watched_file.inode] it has the offset of the previous line.

It is planned that all input plugins will read and send a chunk of data to the codec. This eliminates the problem of mismatching a codec that expects chunks (or lines) with an input that is providing lines (or chunks). Logstash cannot have multiple codecs associated with an input at the moment but there are clear cases where this is needed.

The recording of progress positional data in the sincedb is done after the event is assumed to be created and put in the queue - in general terms this can be classed as acknowledgement. At the moment acknowledgement is done in an arbitrary way by each input. For example the JDBC input records the ID of the last-read-record so that, on restart, it will not reread the previous records. These "acks" are inferred from the return of the method call that adds the event to the queue.

If we move the "extract lines from chunks" from filewatch to the input then we will need a callback in filewatch to accept the position information to write to the sincedb.

Nov 07 '16 11:11 guyboertje

Oct 08 '18 13:10 hsluoyz

Has this been done? I do not see it in the latest 7.6 logstash release

Apr 20 '20 19:04 MikeSaveItiviti

Why is this useful feature not prioritized? It's open for 5 years. is it coming out anytime soon?

Jan 18 '22 23:01 Blue-4367

logstash-input-file logstash-input-file copied to clipboard

Allow recording the file byte offset into a field value

logstash-input-file
logstash-input-file copied to clipboard