logstash Logstash stops processing when using persistent queues

LS version: 6.2.2
Operating System: Ubuntu 16.04
Kafka: 2.11-0.11.0.2
Config File: https://gist.github.com/ceeeekay/1dfe7ae18cadd0b903c859b106940462
Sample Data: any
Steps to Reproduce: Enable persistent queues on previously working pipeline

I've updated logstash.yml to use small persistent queues for a bit of extra redundancy, and have run into a problem where individual nodes will just stop processing messages after a random amount of time.

Kafka also shows the partition they're reading from as stopped.

The nodes themselves are still doing something, as they are still producing metrics, so I suspect it's related to the Kafka input somehow. I can't be sure this is the case though, so I'm raising this here.

Once a node stops, the remaining nodes will pick up the messages from its Kafka partition after a ~5 minute delay, however if left long enough, all the nodes will eventually stop processing.

I really have no idea where to start looking as there is nothing of note in logstash-plain.log.

logstash.yml changes are:

queue.type: persisted
queue.page_capacity: 10mb
queue.max_bytes: 10mb
queue.checkpoint.writes: 256

If I disable persistent queues, Logstash goes back to behaving itself. Enable queues again, and I get more stoppages.

Jun 10 '18 23:06 ceeeekay

This has just happened again, with 6.3.1 this time, although it's taken about two days for nodes to stop processing messages (since they were last restarted).

Restart and all is well, although the entire pipeline was stopped long enough for me to be paged.

[edit] - I should be clear that local queues had only just been re-enabled since opening this issue, and stopped processing again within a few days.

Jul 15 '18 09:07 ceeeekay

4 months later and not even a reply to this issue.

This issue is still occurring with 6.4.3, and makes persistent queues completely unusable with a Kafka input.

Queueing aside, as this is the only way to get delivery guarantees from Logstash, this is a pretty big problem.

Nov 13 '18 03:11 ceeeekay

Bump

Anyone? It's been over 6 months of silence on this issue.

Jan 16 '19 20:01 ceeeekay

Hello, We have a similar issue with logstash 6.7.1 with TCP input and Elasticsearch output.

The problem only occurs on our environment that ingest a lot of logs. Other don't have the issue.

After a few days, no more logs are ingested by elasticsearch. There is no error in ES nor LS logs. When we restart logstash, logs are ingested again.

Aug 23 '19 12:08 rmillet-rs

Just pointing out that this is still broken in 7.4.2.

Everything operates fine for a while, and then it stops processing for no reason, with no visible error. Recently it happened to a node with only a beats input, so combined with @rmillet-rs report, I don't think it has much to do with the type of input.

Even on a test system with low event rate (6/s) it won't last a week without stalling.

Nov 11 '19 00:11 ceeeekay

Any update here? I am seeing same issue with logstash 7.5.2

Mar 12 '20 14:03 Sawan12345

Hello,

I might have the same issue there, but might also be something else. ELK stack to version 7.6.1.

I have logstash with PQ enabled, and when Persistent Queue gets full, NO logs are processed since then, also CPU is stuck at 100% like it's processing something but can't really output it

If i empty the PQ (Delete the queue folder) then for a while, everything gets back to normal, meaning that logs do get processed as expected. But as soon as the PQ fills up, the ingestion process is stuck.

No new events gets received by logstash (as PQ is full) but also nothing is processed from the PQ, causing the whole stack to stall. That is the point to me, but i don't get why logstash doesn't process events from the persistent queue. No details from logstash's logs or elastic ones.

I guess that this might be due to the logs overwhelming the logstash instance, but does this makes sense?

EDIT: Just an update, I do have the same issue even without the Persistent Queue being enabled. I guess that mine is another issue, if you don't have any problems without PQ

Mar 23 '20 17:03 agi0rgi

Hi all,

I think i might have found a solution to my issue by tweaking the settings and the pipeline (It's been 3 days straight since i'm not having an issue, and it's a goal for me), if you want to give them a try, here you go:

Adding -Djruby.regexp.interruptible=true in the jvm.options
Add the parameter timeout_millis => 10000 to each grok or kv filter

Hope that works for you too. Regards

Mar 27 '20 09:03 agi0rgi

Hi everyone.

problem : same - Logstash stops processing when using persistent queues configraion : logstash 7.11.1

input {
  file {
    path => "${LOGSTASH_HOME}/request-tracing.log"
    start_position => beginning
    sincedb_path => "${LOGSTASH_HOME}/data/plugins/inputs/file/.request-tracking-file"
  }
}
filter {}
output {
   stdout {}
}

- pipeline.id: request
  pipeline.workers: 1
  pipeline.batch.size: 256
  path.config: "${LOGSTASH_HOME}/config/conf.d/*request.conf"
  queue.type: persisted
  queue.max_bytes: 8mb

reason : queue.page_capacity > queue.max_bytes solution : setup queue.page_capacity value less then queue.max_bytes (consider with default values)

Mar 11 '21 10:03 dasvex

reason : queue.page_capacity > queue.max_bytes solution : setup queue.page_capacity value less then queue.max_bytes (consider with default values)

Use the internal monitoring API of LS to check it. For instance http://localhost:9600/_node/stats/pipelines?pretty

This list all pipelines and in the response json you have the queue details for the persistent queue. So the example in the docs of 10mb will never work, if you do not change the queue.capacity to less than 10mb. In my case I left the default and used 192mb as queue.max_bytes.

   "queue" : {
        "data" : {
          "free_space_in_bytes" : 53596934144,
          "storage_type" : "xfs",
          "path" : "/var/lib/logstash/queue/tomcat-access-logs"
        },
        "events" : 0,
        "type" : "persisted",
        "capacity" : {
          "max_unread_events" : 0,
          "page_capacity_in_bytes" : 67108864,
          "queue_size_in_bytes" : 10555,
          "max_queue_size_in_bytes" : 201326592
        },
        "events_count" : 0,
        "queue_size_in_bytes" : 10555,
        "max_queue_size_in_bytes" : 201326592
      }

Jul 26 '22 12:07 cinhtau

logstash logstash copied to clipboard

Logstash stops processing when using persistent queues

logstash
logstash copied to clipboard