logstash-input-file icon indicating copy to clipboard operation
logstash-input-file copied to clipboard

Allow recording the file byte offset into a field value

Open jordansissel opened this issue 9 years ago • 15 comments

Originally from: https://github.com/elasticsearch/logstash/issues/1641

jordansissel avatar Jan 23 '15 23:01 jordansissel

Similar report with specific multiline concerns in https://logstash.jira.com/browse/LOGSTASH-1044

wiibaa avatar Feb 25 '15 05:02 wiibaa

+1

naisanza avatar Jun 10 '15 16:06 naisanza

This is exactly what I am looking for.

flicker581 avatar Jul 07 '15 06:07 flicker581

+1

sergioedo avatar Jul 12 '15 14:07 sergioedo

+1

leeeena avatar Oct 30 '15 14:10 leeeena

+1

freben avatar Oct 30 '15 14:10 freben

@jordansissel For us, it is a matter of wanting to be able to read out lines in the exact order they were ingested. Ordering on @timestamp won't be enough, since some log events are emitted with a pretty high rate and have the same timestamp. Thus, any form of incrementing number (globally, locally or other) in addition to the timestamp would be valuable.

freben avatar Oct 30 '15 14:10 freben

Sequence counter plugin, not exactly byte offset but better than nothing: https://github.com/leeeena/logstash-filter-seq

leeeena avatar Nov 03 '15 12:11 leeeena

+1

fbaligand avatar May 18 '16 16:05 fbaligand

+1

doingitbigdata avatar Oct 28 '16 14:10 doingitbigdata

I'm very new to Ruby but this patch appears to accomplish the goal of this issue.

diff -r logstash-5.0.0/vendor/bundle/jruby/1.9/gems/filewatch-0.9.0/lib/filewatch/observing_tail.rb logstash-5.0.0.eric/vendor/bundle/jruby/1.9/gems/filewatch-0.9.0/lib/filewatch/observing_tail.rb
10c10
<       def accept(line) end
---
>       def accept(line, offset) end
79c79
<             listener.accept(line)
---
>             listener.accept(line, @sincedb[watched_file.inode])

diff -r logstash-5.0.0/vendor/bundle/jruby/1.9/gems/logstash-input-file-4.0.0/lib/logstash/inputs/file.rb logstash-5.0.0.eric/vendor/bundle/jruby/1.9/gems/logstash-input-file-4.0.0/lib/logstash/inputs/file.rb
177a178
>     @offset = 0
254c255
<     attr_reader :input, :path, :data
---
>     attr_reader :input, :path, :data, :offset
266c267
<     def accept(data)
---
>     def accept(data, offset)
269c270
<       input.codec.accept(dup_adding_state(data))
---
>       input.codec.accept(dup_adding_state(data, offset))
274a276
>       event.set("offset", offset)
278c280
<     def add_state(data)
---
>     def add_state(data, offset)
279a282
>       @offset = offset
286,287c289,290
<     def dup_adding_state(line)
<       self.class.new(path, input).add_state(line)
---
>     def dup_adding_state(line, offset)
>       self.class.new(path, input).add_state(line, offset)

internetjanitor avatar Nov 07 '16 02:11 internetjanitor

This is a feature that is on the radar for any future development of this plugin.

There are things to consider though, in general and about patch above.

  • We cannot keep adding another argument to all the methods in the call chain for each extra piece of information.
  • This extra information is generalised as context or provenance, i.e. stuff that describes where the data came from. Allowing for the capture of context is a feature that will eventually become available on all Logstash Inputs (or data sources) as well as Beats.
  • The proposal to get the offset from @sincedb[watched_file.inode] is problematic because at the time of calling @sincedb[watched_file.inode] it has the offset of the previous line.

It is planned that all input plugins will read and send a chunk of data to the codec. This eliminates the problem of mismatching a codec that expects chunks (or lines) with an input that is providing lines (or chunks). Logstash cannot have multiple codecs associated with an input at the moment but there are clear cases where this is needed.

The recording of progress positional data in the sincedb is done after the event is assumed to be created and put in the queue - in general terms this can be classed as acknowledgement. At the moment acknowledgement is done in an arbitrary way by each input. For example the JDBC input records the ID of the last-read-record so that, on restart, it will not reread the previous records. These "acks" are inferred from the return of the method call that adds the event to the queue.

If we move the "extract lines from chunks" from filewatch to the input then we will need a callback in filewatch to accept the position information to write to the sincedb.

guyboertje avatar Nov 07 '16 11:11 guyboertje

+1

hsluoyz avatar Oct 08 '18 13:10 hsluoyz

Has this been done? I do not see it in the latest 7.6 logstash release

MikeSaveItiviti avatar Apr 20 '20 19:04 MikeSaveItiviti

Why is this useful feature not prioritized? It's open for 5 years. is it coming out anytime soon?

Blue-4367 avatar Jan 18 '22 23:01 Blue-4367