logstash
logstash copied to clipboard
Logstash stops processing when using persistent queues
- LS version: 6.2.2
- Operating System: Ubuntu 16.04
- Kafka: 2.11-0.11.0.2
- Config File: https://gist.github.com/ceeeekay/1dfe7ae18cadd0b903c859b106940462
- Sample Data: any
- Steps to Reproduce: Enable persistent queues on previously working pipeline
I've updated logstash.yml to use small persistent queues for a bit of extra redundancy, and have run into a problem where individual nodes will just stop processing messages after a random amount of time.
Kafka also shows the partition they're reading from as stopped.
The nodes themselves are still doing something, as they are still producing metrics, so I suspect it's related to the Kafka input somehow. I can't be sure this is the case though, so I'm raising this here.
Once a node stops, the remaining nodes will pick up the messages from its Kafka partition after a ~5 minute delay, however if left long enough, all the nodes will eventually stop processing.
I really have no idea where to start looking as there is nothing of note in logstash-plain.log.
logstash.yml changes are:
queue.type: persisted
queue.page_capacity: 10mb
queue.max_bytes: 10mb
queue.checkpoint.writes: 256
If I disable persistent queues, Logstash goes back to behaving itself. Enable queues again, and I get more stoppages.
This has just happened again, with 6.3.1 this time, although it's taken about two days for nodes to stop processing messages (since they were last restarted).
Restart and all is well, although the entire pipeline was stopped long enough for me to be paged.
[edit] - I should be clear that local queues had only just been re-enabled since opening this issue, and stopped processing again within a few days.
4 months later and not even a reply to this issue.
This issue is still occurring with 6.4.3, and makes persistent queues completely unusable with a Kafka input.
Queueing aside, as this is the only way to get delivery guarantees from Logstash, this is a pretty big problem.
Bump
Anyone? It's been over 6 months of silence on this issue.
Hello, We have a similar issue with logstash 6.7.1 with TCP input and Elasticsearch output.
The problem only occurs on our environment that ingest a lot of logs. Other don't have the issue.
After a few days, no more logs are ingested by elasticsearch. There is no error in ES nor LS logs. When we restart logstash, logs are ingested again.
Just pointing out that this is still broken in 7.4.2.
Everything operates fine for a while, and then it stops processing for no reason, with no visible error. Recently it happened to a node with only a beats input, so combined with @rmillet-rs report, I don't think it has much to do with the type of input.
Even on a test system with low event rate (6/s) it won't last a week without stalling.
Any update here? I am seeing same issue with logstash 7.5.2
Hello,
I might have the same issue there, but might also be something else. ELK stack to version 7.6.1.
I have logstash with PQ enabled, and when Persistent Queue gets full, NO logs are processed since then, also CPU is stuck at 100% like it's processing something but can't really output it
If i empty the PQ (Delete the queue folder) then for a while, everything gets back to normal, meaning that logs do get processed as expected. But as soon as the PQ fills up, the ingestion process is stuck.
No new events gets received by logstash (as PQ is full) but also nothing is processed from the PQ, causing the whole stack to stall. That is the point to me, but i don't get why logstash doesn't process events from the persistent queue. No details from logstash's logs or elastic ones.
I guess that this might be due to the logs overwhelming the logstash instance, but does this makes sense?
EDIT: Just an update, I do have the same issue even without the Persistent Queue being enabled. I guess that mine is another issue, if you don't have any problems without PQ
Hi all,
I think i might have found a solution to my issue by tweaking the settings and the pipeline (It's been 3 days straight since i'm not having an issue, and it's a goal for me), if you want to give them a try, here you go:
- Adding -Djruby.regexp.interruptible=true in the jvm.options
- Add the parameter timeout_millis => 10000 to each grok or kv filter
Hope that works for you too. Regards
Hi everyone.
problem : same - Logstash stops processing when using persistent queues
configraion :
logstash 7.11.1
input {
file {
path => "${LOGSTASH_HOME}/request-tracing.log"
start_position => beginning
sincedb_path => "${LOGSTASH_HOME}/data/plugins/inputs/file/.request-tracking-file"
}
}
filter {}
output {
stdout {}
}
- pipeline.id: request
pipeline.workers: 1
pipeline.batch.size: 256
path.config: "${LOGSTASH_HOME}/config/conf.d/*request.conf"
queue.type: persisted
queue.max_bytes: 8mb
reason : queue.page_capacity > queue.max_bytes solution : setup queue.page_capacity value less then queue.max_bytes (consider with default values)
reason : queue.page_capacity > queue.max_bytes solution : setup queue.page_capacity value less then queue.max_bytes (consider with default values)
Use the internal monitoring API of LS to check it. For instance http://localhost:9600/_node/stats/pipelines?pretty
This list all pipelines and in the response json you have the queue details for the persistent queue. So the example in the docs of 10mb will never work, if you do not change the queue.capacity to less than 10mb. In my case I left the default and used 192mb as queue.max_bytes.
"queue" : {
"data" : {
"free_space_in_bytes" : 53596934144,
"storage_type" : "xfs",
"path" : "/var/lib/logstash/queue/tomcat-access-logs"
},
"events" : 0,
"type" : "persisted",
"capacity" : {
"max_unread_events" : 0,
"page_capacity_in_bytes" : 67108864,
"queue_size_in_bytes" : 10555,
"max_queue_size_in_bytes" : 201326592
},
"events_count" : 0,
"queue_size_in_bytes" : 10555,
"max_queue_size_in_bytes" : 201326592
}