logstash-input-file Expose extra stats from the input file

Nowadays we expose the path of a certain file that we are reading, so then it's presented in the document.

My proposal is to expose mtime, byte offset, file size as optional fields in the file input, so instead of the following json:

{
       "message" => "Copyright 2012-2015 Elasticsearch",
      "@version" => "1",
    "@timestamp" => "2016-02-10T17:27:05.525Z",
          "host" => "Gabriels-MacBook-Pro.local",
          "path" => "/Users/Gabriel/Documents/ElasticSearch/logstash-2.1.1/NOTICE.TXT"
}

We can get:

{
       "message" => "Copyright 2012-2015 Elasticsearch",
      "@version" => "1",
    "@timestamp" => "2016-02-10T17:27:05.525Z",
          "mtime" => "2016-02-10T17:27:05.525Z",
          "byte_offset" => "1231231",
          "file_size" => "12mb",
          "host" => "Gabriels-MacBook-Pro.local",
          "path" => "/Users/Gabriel/Documents/ElasticSearch/logstash-2.1.1/NOTICE.TXT"
}

Or a new object field extra_stats that contain all this information.

Feb 10 '16 17:02 gmoskovicz

@jordansissel what do you think?

Feb 10 '16 17:02 gmoskovicz

+1 the idea.

Some thoughts on the fields:

mtime - I feel like this needs to have a more human readable name.
file_size - My preference is to have this value be a numeric in byte units, not "human readable" (otherwise Kibana/Elasticsearch cannot do math on it)
file_size - This also feels similar to the byte_offset field. How are they different?

Other questions:

Should these new fields be put under @metadata?
What would the setting name be for the way to ask for these new fields to be present?

Feb 10 '16 17:02 jordansissel

Maybe we can just add mtime and file_size. The byte_offset can be the offset at a certain time, but the actual size of the file maybe is now higher? That's why i added both but maybe the file listerner does retrieve this when creating the listener rather than reading again a property from the file.

Should these new fields be put under @metadata?

+1, all in @metadata

What would the setting name be for the way to ask for these new fields to be present?

fileinfo? In the elasticsearch input we have the same behaviour for the documents, and it's called docinfo, so this will make sense.

Feb 10 '16 17:02 gmoskovicz

In https://github.com/jordansissel/ruby-filewatch/blob/cf60cb421d447581549c5a2f8f736e3c96d74483/lib/filewatch/yielding_tail.rb#L59-L62

I think that we can yield the byte count as well:

          watched_file.buffer_extract(data).each do |line|
            yield(watched_file.path, line)
            @sincedb[watched_file.inode] += (line.bytesize + @delimiter_byte_size)
          end

to something like

          watched_file.buffer_extract(data).each do |line|
            yield(watched_file.path, line, (line.bytesize + @delimiter_byte_size))
            @sincedb[watched_file.inode] += (line.bytesize + @delimiter_byte_size)
          end

The msize should come from when all this is initialized, so we just read the msize only once.

Feb 10 '16 17:02 gmoskovicz

@gmoskovicz the way we currently do reads (as an implementation detail, this can change), it feels likely that the file_size and byte_offset are almost always very close, even if the file input is very far behind... hmm..

Is the use case to be able to see how far behind in a given file Logstash is reading? (If not , let me know, knowing what the desired measurement is for will help us figure out the implementation details)

Feb 10 '16 18:02 jordansissel

I was talking to Gabriel and prompted this. We were talking about being able to provide an index into the original log file so that users who need to access the original file can rapidly find their way to the log entry. Also we should be able to sort on the field (file_size or byte_offset) in order to programmatically reconstruct the file if necessary.

Feb 10 '16 19:02 rferrante1966

@gmoskovicz - this should be no problem. I am rewriting filewatch to, amongst many things, have a richer mechanism to communicate the context of a line to the file input e.g size, offset etc.

Feb 25 '16 18:02 guyboertje

I have been seeing that the offset is going to be added for the last 4 years and I have yet to see that data point in the event fields. Is there truly a plan to do so and any idea of when it may be implemented? Thanks

Apr 21 '20 16:04 MikeSaveItiviti

logstash-input-file logstash-input-file copied to clipboard

Expose extra stats from the input file

logstash-input-file
logstash-input-file copied to clipboard