cloudwatchlogsbeat icon indicating copy to clipboard operation
cloudwatchlogsbeat copied to clipboard

Add Event ID

Open wernerb opened this issue 4 years ago • 2 comments

This changes the API to FilterLogEventsPages instead of GetLogEvents. The problematic part of it is that NextToken is not in every return message.

To fix this I reworked and use the pagination function from aws sdk that handles each page automatically. We then "move the needle" towards the last event time and use that as the next starting time instead.

I want to understand what other purpose the state has except for deduplication. If that is the problem then the new Meta._id field will probably fix this: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html I did attempt to change NextToken to LastEventMessage instead.

Could you advise on what to do wrt multiline buffer? I am unsure if this works correctly with LastEventMessage being used as the starttime now when recovering registry. But again, I rather have the beat reprocess the logs it might have missed and send them again. I am curious what you think.

Closes #35

wernerb avatar Dec 17 '20 19:12 wernerb

Thanks @wernerb! Some thoughts:

  • first of all, let me say that it'd be great if we're able to get rid of the state saved in the s3 bucket (one dependency less!). It seems to me that this is doable with the approach that you propose (the Meta._id field). As I understand it, when the beat starts then it will process all messages within its horizon and send them to the sink (logstash or ES). The messages that have already been ingested (based on the Meta._id field will be discarded by ES. Will this work out of the box on the ES side?

  • if the above is true, then I would say that the s3 file is redundant! Indeed, its only purpose is to make sure that we don't process the same message twice. If this deduplication can be performed on the ES side then the only impact would be a little bit more cpu and network traffic when the beat starts up. The magnitude of this impact is dependent on the values of the stream_event_horizon and hot_stream_event_horizon in the config yml (but it shouldn't be that bad for most use cases I suppose).

  • it's not clear to me what we gain by switching to FilterLogEvents from GetLogEvents. You mention that "the problematic part of it is that NextToken is not in every return message". Why is this problematic? I suppose that this is the case where there are currently no more data to be retrieved - wouldn't the same be true with FilterLogEvents (which also uses tokens)? Wouldn't this be the case where lastPage is true in the callback function?

  • as far as multiline buffers are concerned, I am pretty sure that there would be many edge cases where the event would break :) Without thinking that much, I'd dare say that getting rid of the s3 state would be beneficial to multiline events because the re-processing of the messages would result in rescuing multiline events that would not have been properly assembled in other cases.

To sum it up:

  • can we confirm that the Meta._id attribute in the event's payload will work out of the box on all reasonable ES installations (i.e. avoid insertion of duplicate messages)?
  • if yes, then I would propose to kill the s3 registry functionality and reprocess all messages within the configured horizons when the beat starts up (since we have established that we avoid duplicated documents in ES)
  • given the above, would there still be a reason to change our current use of the cloudwatchlogs API (i.e. GetLogEvents)? My argument here is why change something that works fine and face the risk of introducing new bugs in the codebase without getting some benefit back?

Happy to hear feedback on the above!

kkentzo avatar Dec 26 '20 15:12 kkentzo

can we confirm that the Meta._id attribute in the event's payload will work out of the box on all reasonable ES installations (i.e. avoid insertion of duplicate messages)?

For anyone using an elasticsearch output in filebeat this works out of the box. The documentation isn't clear as filebeat elasticsearch output doesn't support updating documents anyway but here the elasticsearch team member confirms what I see in my setup which is that documents are not overwritten because it uses the POST /<target>/_create/<_id> api which doesn't support updating existing documents. I have tested this numerous times and this works out of the box when using the elasticsearch output in filebeat!

Now where things become interesting is for other outputs such as logstash,kafka,redis: source

When publishing to elasticsearch, the id value will be used for _id. When publishing to logstash/redis/kafka, beats add a @metadata field to the event. This field will contain the id. You can configure the elasticsearch output in Logstash to use [@metadata][id]. This means that there is some extra configuration work involved in logstash (because redis/kafka are always inputs to logstash when you want to get them into elasticsearch). But this is described here: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html#ls-doc-id and we can adopt this in the README would be my proposal.

Elasticsearch can only deduplicate events going to the same index, because obviously a new daily index will not have that event. This is fine because by default daily index setups use the original timestamp to decide what index to go from and you can set as large an horizon as you see fit. For setups like mine where we use ILM it is more difficult as the indexes are rolled depending on index age or size of index. See issue here: https://github.com/elastic/elasticsearch/issues/44794. This basically means when a rollover happens that elasticsearch will just index the documents again because its a new index. This is not really an issue for our use-cases and even then we would see the timestamp/event be the same for a small number of events. We really only expect the cloudwatchlogsbeat to be down for a few hours max at any time so the horizon is rather short, the probability of having a rollover event happen is also rather small with large enough indices (50gb) and index age being a few days.

To summarise the above:

  1. Elasticsearch output in filebeat supports this out of the box
  2. Any other output will carry the _id in metadata and needs to be configured to be used as document_id in logstash as per https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html#ls-doc-id
  3. I would add to the README the logstash caveat and the setting of horizon in combination with ILM to explain the tradeoffs.

if yes, then I would propose to kill the s3 registry functionality and reprocess all messages within the configured horizons when the beat starts up (since we have established that we avoid duplicated documents in ES)

It really depends on what you think. The rollover stuff with ILM can really be an issue but it is an upstream issue I expect to be fixed. We could remove the message stuff entirely to make things simpler and keep setting NextToken (if available) to state.

given the above, would there still be a reason to change our current use of the cloudwatchlogs API (i.e. GetLogEvents)? My argument here is why change something that works fine and face the risk of introducing new bugs in the codebase without getting some benefit back?

This is easy, this is because GetLogEvents doesn't include EventIDs otherwise I would not have changed it. See http://docs.amazonaws.cn/AmazonCloudWatchLogs/latest/APIReference/API_GetLogEvents.html and https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_FilterLogEvents.html

What is also nice about FilterLogEvents is that a pattern to filter can be added for filtering events (if desirable) and it is possible to query multiple streams in 1 call instead of doing lots of GetLogEvents per stream. See logStreamNamePrefix. It also has ingestionTime which could be useful for people to use.

I am assuming results from GetLogEvents comes in, the NextToken is then written out only after handling all the events in it. Because if we have the events in memory, then they should be handled and beat buffers will handle processing everything correctly. The API states that with GetLogEvents returns NextToken always even if you arrive at the last page and that if you are on the last page it returns the same NextToken. How in the code is this case handled? If at last page then state is again written out but this would I guess cause the last page event to be repeated in case of failure?

wernerb avatar Dec 29 '20 12:12 wernerb