graylog2-server icon indicating copy to clipboard operation
graylog2-server copied to clipboard

Provide a way to debug indexer failures

Open dploeger opened this issue 7 years ago • 19 comments

Expected Behavior

Graylog is reporting indexer failures with an index and a letter id, but doesn't store the failed message anywhere. This makes it hard to debug and fix those failures.

Current Behavior

Currently there is no way of debugging indexer failures. It's more of a guessing game.

Possible Solution

One way would be to use the mongodb to also store the failed, raw message alongside the indexer failure. This would obviously increase the mongodb load. Another way would be a ring buffer on the disk.

Context

Everytime you switch the logging of a new application on to Graylog, you're somewhat guessing the things, that may get logged, but these guesses may be wrong and lead to indexer failures. If you have a lot of new applications in one go, you have a hard time finding the actual failed message.

Your Environment

  • Graylog Version: 2.2.3
  • Elasticsearch Version: 2.3.5
  • MongoDB Version: 3.4.0
  • Operating System: Ubuntu 16.04
  • Browser version: Chrome 64.0.3282.140 on macOS 10.13.3

dploeger avatar Feb 14 '18 08:02 dploeger

@dploeger Indexing failures are stored in MongoDB (in the index_failures collection) and can be viewed in the Graylog web interface on the System/Overview page.

Storing the complete failed message might lead to problems with Graylog clusters having not enough resources, so we removed that feature in Graylog 2.0.0 (see #1498).

As an example, think of a 10 MiB message which is being sent multiple times (e. g. 1000) to Graylog and which cannot be indexed. This would make MongoDB explode pretty quickly.

FWIW, we could probably add some more details about the failed messages, such as the input and the remote IP address of the sender.

joschi avatar Feb 14 '18 08:02 joschi

@joschi Yes, the input and the source would be great as a start. Another possibility would be a UI to look inside the message processor. This way you could perhaps observe failed messages while they are processed. I understand, you're using Kafka there?

dploeger avatar Feb 14 '18 08:02 dploeger

@dploeger No, Graylog is using the journal implementation of Apache Kafka internally (for persisting messages between the input and the processing buffers) but not the complete package.

joschi avatar Feb 14 '18 08:02 joschi

@joschi Hm, okay. Then the UI will be a bit more complicated. 😄

So:

  • good: source and input added to the index failures collection
  • best: view of the message process

Thanks!

dploeger avatar Feb 14 '18 08:02 dploeger

Even saved messages with a tunable length cap would be preferable to no information at all. With more than a trivial number of inputs and messages, it's nearly impossible to address those indexing errors. You also wouldn't need to keep but a single copy for a given error message key (I have nearly 1000 index failures with the exact same error message).

twall avatar Mar 21 '18 15:03 twall

Think I'm coming up against this issue too. Have a large number of these failures:

{"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field="entity" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[76, 105, 115, 116, 40, 79, 102, 102, 101, 114, 101, 100, 65, 112, 112, 111, 105, 110, 116, 109, 101, 110, 116, 40, 50, 48, 49, 56, 45, 49]...', original message: bytes can be at most 32766 in length; got 34590","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"bytes can be at most 32766 in length; got 34590"}}`

In the last five minutes I have 20 different sources logging messages with an 'entity' field. Would be great to be able to stop these.

apjm avatar Oct 12 '18 12:10 apjm

@apjm I guess, you can only strip the field "entity" in your installation. That's what I did with a field "stacktrace", which got too big. Just strip it down to 32766 characters max.

dploeger avatar Oct 12 '18 12:10 dploeger

Is it not possible to store these Messages? Only the last 5 and shortend to 32k? Or write fields to the Log defined in a Configuration? My Logs come from a web Farm, each "source" have multple web apps on board. So a source and input field in log would not helpful in my case. (differnt field types)

Jan-Zimmermann avatar Feb 14 '19 06:02 Jan-Zimmermann

I'm also affected by this and having no means whatsoever to even get an idea what the offending event is makes this way harder than it needs to be. Either being able to view real-time processing of inputs oder at least logging the first n characters of an event that caused an indexing failure would help tremendously.

megamorf avatar Jul 12 '19 10:07 megamorf

Is there anything new about this issue?

juemue avatar Oct 28 '19 15:10 juemue

I have a related question. Is it possible to get Indexer failures metric?

I tried org_graylog2_rest_resources_system_indexer_FailuresResource_count from Prometheus plugin, but looks like it's not the same as Indexer failures status in System->Overview

I use this plugin https://github.com/graylog-labs/graylog-plugin-metrics-reporter

P.S. For now, I could get this metric only from mongo. This request gets amount of failed indexes for last 5 minutes:

mongo graylog --quiet --eval 'db.index_failures.count() - db.index_failures.find({timestamp: {$lt: new Date((new Date())-1000*60*5)}}).count()'

Nklya avatar Nov 20 '19 14:11 Nklya

Any updates regarding this?

In our scenario we have hundreds of log sources and dozens of those send logs in json-format where fields are extracted automatically. Currently it is near impossible to troubleshoot indexing failures.

Storing the complete failed message might lead to problems with Graylog clusters having not enough resources...

This is not acceptable explanation.

  • Amount of failed messages stored could be limited/configurable.
  • As a bare minimum the metadata of the failed message should be stored (source, input).

hulkk avatar Apr 29 '20 08:04 hulkk

I believe this is a really important feature that lots of users need to find what causes their logs to be incomplete.

In my case I'm seeing lots of max_bytes_length_exceeded_exception from ES side and there is absolutely no way to figure out what was that message and where it came from. I'm just losing my logs, that are probably important ones that had lots of data!

As @joschi said two years ago:

FWIW, we could probably add some more details about the failed messages, such as the input and the remote IP address of the sender.

I also agree that these can be a good and probably easy start to let users debug their usage.

alitoufighi avatar Jun 17 '20 11:06 alitoufighi

Currently it is impossible to troubleshoot log payload related errors.

I'm strugging with lots of following events

2020-07-02T16:56:21.210+03:00 WARN [GelfCodec] GELF message <redacted> (received from <loadbalancer-ip:26444>) is missing mandatory "host" field.

2020-07-02T16:56:20.549+03:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=redacted, journalOffset=41919444374, codec=gelf, payloadSize=637, timestamp=2020-07-02T13:56:20.549Z, remoteAddress=/loadbalancer-ip:28843}
java.lang.IllegalArgumentException: GELF message <redacted> (received from <loadbalancer-ip:28843>) has empty mandatory "short_message" field.
	at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:258) ~[graylog.jar:?]
	at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:140) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:150) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:91) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:90) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:47) [graylog.jar:?]
	at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

I don't understand why the message guid is logged because you can't use the guid anywhere. One should be able to configure content what to log in case these errors occur.

hulkk avatar Jul 02 '20 14:07 hulkk

@apjm I guess, you can only strip the field "entity" in your installation. That's what I did with a field "stacktrace", which got too big. Just strip it down to 32766 characters max.

do you have an example I could follow. still having this issue in mid 2022

jasonrberk avatar Jun 22 '22 18:06 jasonrberk

I figured this out for anyone still fighting this issue in 2022 on GL4

Screen Shot 2022-06-23 at 11 29 29 AM

jasonrberk avatar Jun 23 '22 15:06 jasonrberk

Hey @jasonrberk that's great. Your workaround for truncating the log reminded me of an alternative way: You can use pipelines, especially the substring() function to capture only the first 32766 characters. However, I had unsatisfactory results using this when my logs contained UTF-8 characters. This way, you can also debug() their complete message for more inspection, if such a large log can indicate that something is wrong.

alitoufighi avatar Jun 23 '22 15:06 alitoufighi

@alitoufighi

here's a pipeline rule for "trimming" enormous stack traces:

rule "truncate-stacks"
when    
    has_field("StackTrace")
    AND regex(
        pattern: "^.{32000,}$", 
        value: replace(
            value: to_string($message.StackTrace), 
            search: "\n",
            replacement: "|"
        )
    ).matches == true
then
    let trimmed = abbreviate(value: to_string($message.StackTrace), width: 32000);
    set_field("StackTrace", trimmed);
    set_field("StackTraceTruncated", true);
    // added debug message to be notified about the trimmed stacktrace
    debug( concat("truncated oversized StackTrace from ", to_string($message.source)));
end

jasonrberk avatar Aug 02 '22 21:08 jasonrberk

We added a related feature in Graylog 4.2 (available in Graylog Operations and Graylog Cloud):

https://docs.graylog.org/docs/indexer-and-processing-failures

It allows for storing indexing and processing failure notifications in a dedicated Elasticsearch/OpenSearch failure index. With a dedicated stream for such failure notifications, these messages can be logged and aggregated meaningfully into a dashboard, used to set up alert notifications, and more. It is also possible to keep the full original message that failed.

boosty avatar Aug 04 '22 11:08 boosty