graylog2-server
graylog2-server copied to clipboard
Processor stops processing after a short time
Expected Behavior
Processor should continue to process messages.
Current Behavior
After a short time of restarting the server, the processor stops processing messages and the journal fills up. We see this in the log:
2022-06-17 11:21:57,189 ERROR: com.google.common.util.concurrent.ServiceManager - Service LocalKafkaMessageQueueReader [FAILED] has failed in the RUNNING state.
java.lang.IllegalStateException: Invalid message size: 0
at org.graylog.shaded.kafka09.log.FileMessageSet.searchFor(FileMessageSet.scala:141) ~[graylog.jar:?]
at org.graylog.shaded.kafka09.log.LogSegment.translateOffset(LogSegment.scala:105) ~[graylog.jar:?]
at org.graylog.shaded.kafka09.log.LogSegment.read(LogSegment.scala:148) ~[graylog.jar:?]
at org.graylog.shaded.kafka09.log.Log.read(Log.scala:506) ~[graylog.jar:?]
at org.graylog2.shared.journal.LocalKafkaJournal.read(LocalKafkaJournal.java:677) ~[graylog.jar:?]
at org.graylog2.shared.journal.LocalKafkaJournal.readNext(LocalKafkaJournal.java:617) ~[graylog.jar:?]
at org.graylog2.shared.journal.LocalKafkaJournal.read(LocalKafkaJournal.java:599) ~[graylog.jar:?]
at org.graylog2.shared.messageq.localkafka.LocalKafkaMessageQueueReader.run(LocalKafkaMessageQueueReader.java:110) ~[graylog.jar:?]
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:67) [graylog.jar:?]
at com.google.common.util.concurrent.Callables$4.run(Callables.java:121) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]
Oddly enough, if I have the Graylog logging set to Trace, then the issue does not occur, it is only when the logging is set to Debug or below that it will happen.
Possible Solution
Unknown, have tried multiple things to resolve, no success.
Steps to Reproduce (for bugs)
- Enable Syslog UDP input
- Send Syslogs
Context
We use Graylog to consume and index UDP syslog messages, mostly from Wildix cloud PBXes. The issue does not seem to be related to the load on the system, it will happen with 7000 logs per second, or 10.
Your Environment
Graylog running as a container. Docker server is UnRAID 6.10.2 ES also running as a Container, version 6.6.2 (have also tried version 7)
- Graylog Version: 4.3.2
- Java Version: openjdk-8 (included in official Graylog docker)
- Elasticsearch Version: 6.6.2
- MongoDB Version: 4.2.21
- Operating System: UnRAID 6.10.2
- Browser version: Chrome 103.0.5060.53
I am having the exact same issue, on unraid 6.10.3, even with a very light load (< 5 messages per second)
I have now tried this with a completely fresh and default install of Graylog, ES and Mongo. All new containers with no previous config. Same result.
I am also getting this error after updating Unraid to 6.10.3 from 6.9.2.
So it seems to be linked to the unRAID OS possibly. Any ideas on ways to capture the correct data work out what the issue is? I'm happy to test things and provide logs, etc.
Also having this issue. I recently re-built all 3 containers switching to ES 7.10.1 and thought it was something to do with that.
- unraid 6.10.3
- graylog 4.3.2+313b6bc
- elasticsearch 7.10.1
This problem is still persistent after updating to graylog 4.3.3
How can we get the developers to look at this issue?
Hi, I'm trying to reproduce the issue. Are you all using the ES, Mongo and Graylog Apps / Containers provided by the Community App Marketplace? Did the issue only occur after updating to UnraidOS 6.10.3? Thanks, Matthias
Hi, I'm trying to reproduce the issue. Are you all using the ES, Mongo and Graylog Apps / Containers provided by the Community App Marketplace? Did the issue only occur after updating to UnraidOS 6.10.3? Thanks, Matthias
Yes, using the apps from the community app marketplace. I've tried both Elasticsearch containers. MongoDB is the one provided by Taddeusz's Repository
The issue does seem to coincide with the upgrade to 6.10.2 for me (not tried .3 yet) Before that Graylog was ingesting 1000s of logs per second with no issues.
I'm using docker compose as described here: https://whitematter.tech/posts/run-graylog-with-docker-compose-on-unraid/
Repositories are: mongo:5.0 docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 graylog/graylog:4.3.3
Prior to updating to Unraid 6.9.2, there were no issues. Currently I'm running with logging set to Trace, since the problem does not happen when logging is set to Trace. Let me know if you need any more information!
Would it possible to get us the content of your data/journal
directory once you've triggered that issue?
Be aware that this could contain some of your log data, so only share if that's okay with you.
You could also send it by mail
@mpfz0r let me know the best way to share privately and I can get mine sent to you. Looks like it's 56MB.
@Kieffer87 could send me a mail to marco at graylog.org ? I will be in touch
If you need a second example of the data/journal directories, let me know!
@mpfz0r email sent.
@moesterheld thanks for your help with this. Just in case it is relevant, I'm seeing the issue even if I have logging set to trace.
I just want to report that this is still an issue after upgrading to Graylog 4.3.4
i have the same issue, also the same situation as above Unraid 6.10.3 with docker compose
to be a bit more specific I only started using graylog since yesterday so it is a completely new install which has the exact same issue.
Does your journal share use the cache pool of UnraidOS? Could you maybe try to move it to a different share without caching? Does this only occur with Syslog UDP inputs? Could you maybe also try a Random HTTP message generator input?
Thank you, Matthias
My journal does indeed reside on an Unraid "cache pool", but that particular share is set to "Cache:prefer". This means it permanently lives on the cache pool and is never moved (this is typical for Unraid). I will try moving it if you think it will make a difference, but I wanted to make sure you understand that the journal files are not being moved around at any time.
As for my inputs, I currently have the following enabled and receiving messages:
- Beats
- GELF TCP
- GELF UDP
- Syslog TCP
- Syslog UDP
It's not clear that any of these in particular trigger the error, but I can try to take a look at that today.
Are you suggesting that I disable all inputs, and then create a Random HTTP message input to see if the error is still triggered?
I moved the journal to the array directly to see if that makes a difference
I only use syslog udp however
i wil get back to you if I have an update.
If it is possible for you to move the share to use the filesystem, that would be great. Just want to make sure the cache is not the culprit here. As for the inputs, it would be great if we could make sure it's not connected to a particular input type. The Random HTTP would be useful since there is no networking involved to see if we can reproduce the error with this input type alone.
I have also set up a test UnraidOS system now, which has been happily ingesting 60 syslog messages / minute for the last two hours (also using a cached share). Syslogs are pushed in from the UnraidOS host itself and another Docker container. The problem should probably already have occured, right? It is a very simple system with a single xfs data disk, no parity disk and only the Graylog environment running. Maybe we can figure out if a configuration change causes the issue.
Respectfully, I think you may be misunderstanding how Unraid cache works. The cache pool IS part of the file system... it's just another drive (usually an SSD for the speed benefits). If you set up a simple system with only a single data drive, then you will not have any caching regardless of your share cache settings.
For the sake of ruling out potential issues, I will do the following:
- Move the journal to a share on my array. I will run with the existing inputs and see if the error occurs.
- Then, I will disable all of the inputs and add a Random HTTP input and see if the error occurs.
Thanks for your help with this... I'll report back with my findings!
@tkohhh Thanks for the input. I was thinking Unraid is using an in-memory cache.
Here are my findings:
- Moving the journal to the array (instead of the cache pool) had no effect. I received the error after about 15 minutes.
- Disabling all of my inputs and enabling a Random HTTP input has been running for 30 minutes now without the error.
I'm starting to wonder if it's the Unraid syslog that is causing the error. I have the Unraid syslog set up to use syslog TCP. Did you set it up with TCP or UDP on your test system?
I'm going to let just the Random HTTP run for a while longer just to make sure, and then I will do the following:
- Disable remote logging in Unraid settings, and then re-enable all of my Graylog inputs.
- If that runs without the error, then I will re-enable remote logging in Unraid, but use UDP instead of TCP.
I have set it up using UDP.
OK... that may be good news. I've been running for the last hour with the Unraid remote logging disabled, and no error. I'm going to turn it on with UDP now and see what happens.
To the other users that have reported this error: is your Unraid remote logging using TCP or UDP?
@moesterheld can you switch to TCP and see if you get the error?
Hi, I have this issue on 2 unRAID servers. One has remote syslogging enabled using UDP to the Graylog docker. The other has syslogging disabled. 🤷♀️
Tom
On Mon, Aug 8, 2022 at 6:46 PM tkohhh @.***> wrote:
OK... that may be good news. I've been running for the last hour with the Unraid remote logging disabled, and no error. I'm going to turn it on with UDP now and see what happens.
To the other users that have reported this error: is your Unraid remote logging using TCP or UDP?
— Reply to this email directly, view it on GitHub https://github.com/Graylog2/graylog2-server/issues/12949#issuecomment-1208422679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEET55ZLABKJFBBB5QJBELLVYFBYJANCNFSM5Z6FQCZA . You are receiving this because you authored the thread.Message ID: @.***>
UDP did generate the error after about 90 minutes. I'm going back to disabling the Unraid remote logging. I'll let it run for longer this time to see if I get the error.
I can now confirm that the error does indeed occur even with the Unraid remote logging disabled. I received the error about 55 minutes after restarting Graylog.
So, there goes that theory!
@moesterheld what other information can we provide to help get to the bottom of this?