logstash-input-file
logstash-input-file copied to clipboard
Expose extra stats from the input file
Nowadays we expose the path
of a certain file that we are reading, so then it's presented in the document.
My proposal is to expose mtime
, byte offset
, file size
as optional fields in the file input, so instead of the following json:
{
"message" => "Copyright 2012-2015 Elasticsearch",
"@version" => "1",
"@timestamp" => "2016-02-10T17:27:05.525Z",
"host" => "Gabriels-MacBook-Pro.local",
"path" => "/Users/Gabriel/Documents/ElasticSearch/logstash-2.1.1/NOTICE.TXT"
}
We can get:
{
"message" => "Copyright 2012-2015 Elasticsearch",
"@version" => "1",
"@timestamp" => "2016-02-10T17:27:05.525Z",
"mtime" => "2016-02-10T17:27:05.525Z",
"byte_offset" => "1231231",
"file_size" => "12mb",
"host" => "Gabriels-MacBook-Pro.local",
"path" => "/Users/Gabriel/Documents/ElasticSearch/logstash-2.1.1/NOTICE.TXT"
}
Or a new object field extra_stats
that contain all this information.
@jordansissel what do you think?
+1 the idea.
Some thoughts on the fields:
- mtime - I feel like this needs to have a more human readable name.
- file_size - My preference is to have this value be a numeric in byte units, not "human readable" (otherwise Kibana/Elasticsearch cannot do math on it)
- file_size - This also feels similar to the byte_offset field. How are they different?
Other questions:
- Should these new fields be put under
@metadata
? - What would the setting name be for the way to ask for these new fields to be present?
Maybe we can just add mtime
and file_size
. The byte_offset can be the offset at a certain time, but the actual size of the file maybe is now higher? That's why i added both but maybe the file listerner does retrieve this when creating the listener rather than reading again a property from the file.
Should these new fields be put under @metadata?
+1, all in @metadata
What would the setting name be for the way to ask for these new fields to be present?
fileinfo
? In the elasticsearch input we have the same behaviour for the documents, and it's called docinfo
, so this will make sense.
In https://github.com/jordansissel/ruby-filewatch/blob/cf60cb421d447581549c5a2f8f736e3c96d74483/lib/filewatch/yielding_tail.rb#L59-L62
I think that we can yield the byte count as well:
watched_file.buffer_extract(data).each do |line|
yield(watched_file.path, line)
@sincedb[watched_file.inode] += (line.bytesize + @delimiter_byte_size)
end
to something like
watched_file.buffer_extract(data).each do |line|
yield(watched_file.path, line, (line.bytesize + @delimiter_byte_size))
@sincedb[watched_file.inode] += (line.bytesize + @delimiter_byte_size)
end
The msize should come from when all this is initialized, so we just read the msize only once.
@gmoskovicz the way we currently do reads (as an implementation detail, this can change), it feels likely that the file_size
and byte_offset
are almost always very close, even if the file input is very far behind... hmm..
Is the use case to be able to see how far behind in a given file Logstash is reading? (If not , let me know, knowing what the desired measurement is for will help us figure out the implementation details)
I was talking to Gabriel and prompted this. We were talking about being able to provide an index into the original log file so that users who need to access the original file can rapidly find their way to the log entry. Also we should be able to sort on the field (file_size or byte_offset) in order to programmatically reconstruct the file if necessary.
@gmoskovicz - this should be no problem. I am rewriting filewatch to, amongst many things, have a richer mechanism to communicate the context of a line to the file input e.g size, offset etc.
I have been seeing that the offset is going to be added for the last 4 years and I have yet to see that data point in the event fields. Is there truly a plan to do so and any idea of when it may be implemented? Thanks