graylog2-server Processor stops processing after a short time

Expected Behavior

Processor should continue to process messages.

Current Behavior

After a short time of restarting the server, the processor stops processing messages and the journal fills up. We see this in the log:

2022-06-17 11:21:57,189 ERROR: com.google.common.util.concurrent.ServiceManager - Service LocalKafkaMessageQueueReader [FAILED] has failed in the RUNNING state.
java.lang.IllegalStateException: Invalid message size: 0
        at org.graylog.shaded.kafka09.log.FileMessageSet.searchFor(FileMessageSet.scala:141) ~[graylog.jar:?]
        at org.graylog.shaded.kafka09.log.LogSegment.translateOffset(LogSegment.scala:105) ~[graylog.jar:?]
        at org.graylog.shaded.kafka09.log.LogSegment.read(LogSegment.scala:148) ~[graylog.jar:?]
        at org.graylog.shaded.kafka09.log.Log.read(Log.scala:506) ~[graylog.jar:?]
        at org.graylog2.shared.journal.LocalKafkaJournal.read(LocalKafkaJournal.java:677) ~[graylog.jar:?]
        at org.graylog2.shared.journal.LocalKafkaJournal.readNext(LocalKafkaJournal.java:617) ~[graylog.jar:?]
        at org.graylog2.shared.journal.LocalKafkaJournal.read(LocalKafkaJournal.java:599) ~[graylog.jar:?]
        at org.graylog2.shared.messageq.localkafka.LocalKafkaMessageQueueReader.run(LocalKafkaMessageQueueReader.java:110) ~[graylog.jar:?]
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:67) [graylog.jar:?]
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:121) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]

Oddly enough, if I have the Graylog logging set to Trace, then the issue does not occur, it is only when the logging is set to Debug or below that it will happen.

Possible Solution

Unknown, have tried multiple things to resolve, no success.

Steps to Reproduce (for bugs)

Enable Syslog UDP input
Send Syslogs

Context

We use Graylog to consume and index UDP syslog messages, mostly from Wildix cloud PBXes. The issue does not seem to be related to the load on the system, it will happen with 7000 logs per second, or 10.

Your Environment

Graylog running as a container. Docker server is UnRAID 6.10.2 ES also running as a Container, version 6.6.2 (have also tried version 7)

Graylog Version: 4.3.2
Java Version: openjdk-8 (included in official Graylog docker)
Elasticsearch Version: 6.6.2
MongoDB Version: 4.2.21
Operating System: UnRAID 6.10.2
Browser version: Chrome 103.0.5060.53

Jun 27 '22 11:06 tslytsly

I am having the exact same issue, on unraid 6.10.3, even with a very light load (< 5 messages per second)

Jun 28 '22 02:06 avandeputte

I have now tried this with a completely fresh and default install of Graylog, ES and Mongo. All new containers with no previous config. Same result.

Jun 28 '22 05:06 tslytsly

I am also getting this error after updating Unraid to 6.10.3 from 6.9.2.

Jun 30 '22 20:06 tkohhh

So it seems to be linked to the unRAID OS possibly. Any ideas on ways to capture the correct data work out what the issue is? I'm happy to test things and provide logs, etc.

Jul 01 '22 08:07 tslytsly

Also having this issue. I recently re-built all 3 containers switching to ES 7.10.1 and thought it was something to do with that.

unraid 6.10.3
graylog 4.3.2+313b6bc
elasticsearch 7.10.1

Jul 05 '22 02:07 Kieffer87

This problem is still persistent after updating to graylog 4.3.3

How can we get the developers to look at this issue?

Jul 12 '22 17:07 tkohhh

Hi, I'm trying to reproduce the issue. Are you all using the ES, Mongo and Graylog Apps / Containers provided by the Community App Marketplace? Did the issue only occur after updating to UnraidOS 6.10.3? Thanks, Matthias

Jul 18 '22 15:07 moesterheld

Hi, I'm trying to reproduce the issue. Are you all using the ES, Mongo and Graylog Apps / Containers provided by the Community App Marketplace? Did the issue only occur after updating to UnraidOS 6.10.3? Thanks, Matthias

Yes, using the apps from the community app marketplace. I've tried both Elasticsearch containers. MongoDB is the one provided by Taddeusz's Repository

The issue does seem to coincide with the upgrade to 6.10.2 for me (not tried .3 yet) Before that Graylog was ingesting 1000s of logs per second with no issues.

Jul 18 '22 16:07 tslytsly

I'm using docker compose as described here: https://whitematter.tech/posts/run-graylog-with-docker-compose-on-unraid/

Repositories are: mongo:5.0 docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 graylog/graylog:4.3.3

Prior to updating to Unraid 6.9.2, there were no issues. Currently I'm running with logging set to Trace, since the problem does not happen when logging is set to Trace. Let me know if you need any more information!

Jul 18 '22 16:07 tkohhh

Would it possible to get us the content of your data/journal directory once you've triggered that issue? Be aware that this could contain some of your log data, so only share if that's okay with you. You could also send it by mail

Jul 20 '22 11:07 mpfz0r

@mpfz0r let me know the best way to share privately and I can get mine sent to you. Looks like it's 56MB.

Jul 20 '22 17:07 Kieffer87

@Kieffer87 could send me a mail to marco at graylog.org ? I will be in touch

Jul 21 '22 17:07 mpfz0r

If you need a second example of the data/journal directories, let me know!

Jul 21 '22 17:07 tkohhh

@mpfz0r email sent.

Jul 22 '22 13:07 Kieffer87

@moesterheld thanks for your help with this. Just in case it is relevant, I'm seeing the issue even if I have logging set to trace.

Jul 22 '22 13:07 tslytsly

I just want to report that this is still an issue after upgrading to Graylog 4.3.4

Aug 07 '22 23:08 tkohhh

i have the same issue, also the same situation as above Unraid 6.10.3 with docker compose

Aug 08 '22 09:08 DrDirtyDevil

to be a bit more specific I only started using graylog since yesterday so it is a completely new install which has the exact same issue.

Aug 08 '22 10:08 DrDirtyDevil

Does your journal share use the cache pool of UnraidOS? Could you maybe try to move it to a different share without caching? Does this only occur with Syslog UDP inputs? Could you maybe also try a Random HTTP message generator input?

Thank you, Matthias

Aug 08 '22 11:08 moesterheld

My journal does indeed reside on an Unraid "cache pool", but that particular share is set to "Cache:prefer". This means it permanently lives on the cache pool and is never moved (this is typical for Unraid). I will try moving it if you think it will make a difference, but I wanted to make sure you understand that the journal files are not being moved around at any time.

As for my inputs, I currently have the following enabled and receiving messages:

Beats
GELF TCP
GELF UDP
Syslog TCP
Syslog UDP

It's not clear that any of these in particular trigger the error, but I can try to take a look at that today.

Are you suggesting that I disable all inputs, and then create a Random HTTP message input to see if the error is still triggered?

Aug 08 '22 13:08 tkohhh

I moved the journal to the array directly to see if that makes a difference

I only use syslog udp however

i wil get back to you if I have an update.

Aug 08 '22 14:08 DrDirtyDevil

If it is possible for you to move the share to use the filesystem, that would be great. Just want to make sure the cache is not the culprit here. As for the inputs, it would be great if we could make sure it's not connected to a particular input type. The Random HTTP would be useful since there is no networking involved to see if we can reproduce the error with this input type alone.

I have also set up a test UnraidOS system now, which has been happily ingesting 60 syslog messages / minute for the last two hours (also using a cached share). Syslogs are pushed in from the UnraidOS host itself and another Docker container. The problem should probably already have occured, right? It is a very simple system with a single xfs data disk, no parity disk and only the Graylog environment running. Maybe we can figure out if a configuration change causes the issue.

Aug 08 '22 14:08 moesterheld

Respectfully, I think you may be misunderstanding how Unraid cache works. The cache pool IS part of the file system... it's just another drive (usually an SSD for the speed benefits). If you set up a simple system with only a single data drive, then you will not have any caching regardless of your share cache settings.

For the sake of ruling out potential issues, I will do the following:

Move the journal to a share on my array. I will run with the existing inputs and see if the error occurs.
Then, I will disable all of the inputs and add a Random HTTP input and see if the error occurs.

Thanks for your help with this... I'll report back with my findings!

Aug 08 '22 15:08 tkohhh

@tkohhh Thanks for the input. I was thinking Unraid is using an in-memory cache.

Aug 08 '22 15:08 moesterheld

Here are my findings:

Moving the journal to the array (instead of the cache pool) had no effect. I received the error after about 15 minutes.
Disabling all of my inputs and enabling a Random HTTP input has been running for 30 minutes now without the error.

I'm starting to wonder if it's the Unraid syslog that is causing the error. I have the Unraid syslog set up to use syslog TCP. Did you set it up with TCP or UDP on your test system?

I'm going to let just the Random HTTP run for a while longer just to make sure, and then I will do the following:

Disable remote logging in Unraid settings, and then re-enable all of my Graylog inputs.
If that runs without the error, then I will re-enable remote logging in Unraid, but use UDP instead of TCP.

Aug 08 '22 16:08 tkohhh

I have set it up using UDP.

Aug 08 '22 17:08 moesterheld

OK... that may be good news. I've been running for the last hour with the Unraid remote logging disabled, and no error. I'm going to turn it on with UDP now and see what happens.

To the other users that have reported this error: is your Unraid remote logging using TCP or UDP?

@moesterheld can you switch to TCP and see if you get the error?

Aug 08 '22 17:08 tkohhh

Hi, I have this issue on 2 unRAID servers. One has remote syslogging enabled using UDP to the Graylog docker. The other has syslogging disabled. 🤷‍♀️

Tom

On Mon, Aug 8, 2022 at 6:46 PM tkohhh @.***> wrote:

OK... that may be good news. I've been running for the last hour with the Unraid remote logging disabled, and no error. I'm going to turn it on with UDP now and see what happens.

To the other users that have reported this error: is your Unraid remote logging using TCP or UDP?

— Reply to this email directly, view it on GitHub https://github.com/Graylog2/graylog2-server/issues/12949#issuecomment-1208422679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEET55ZLABKJFBBB5QJBELLVYFBYJANCNFSM5Z6FQCZA . You are receiving this because you authored the thread.Message ID: @.***>

Aug 08 '22 18:08 tslytsly

UDP did generate the error after about 90 minutes. I'm going back to disabling the Unraid remote logging. I'll let it run for longer this time to see if I get the error.

Aug 08 '22 18:08 tkohhh

I can now confirm that the error does indeed occur even with the Unraid remote logging disabled. I received the error about 55 minutes after restarting Graylog.

So, there goes that theory!

@moesterheld what other information can we provide to help get to the bottom of this?

Aug 08 '22 20:08 tkohhh

graylog2-server graylog2-server copied to clipboard

Processor stops processing after a short time

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

graylog2-server
graylog2-server copied to clipboard