logstash icon indicating copy to clipboard operation
logstash copied to clipboard

[META] 8.x PQ improvements

Open colinsurprenant opened this issue 5 years ago • 3 comments

These are the PQ issues that should be looked at in 7.x

Phase 1

Permissions & .lock file

  • [x] Logstash fails to acquire lock on PQ lockfile (reentrant?) #10572 PRs: #12023 #12019
  • [x] LockException: The queue failed to obtain exclusive access - likely because of not closing the PQ after a write exception, PRs: #12023 #12019

Page file size too small exception

  • [x] [@kaisecheng] PQ: improve pqrepair or queue open to handle fully acked 0 byte pages #10855
  • [x] [@kaisecheng] Confusing error: Logstash failed to create queue {"exception"=>"Page file size is too small to hold elements" #8480

Docs

  • [x] Better explain behaviour of queue.max_bytes setting #10718
  • [ ] ~~PQ Docs: add note about upgrading #9762~~
  • [x] Doc changes needed to new updatable queue.page_capacity ? #8650

Recovery (pkcheck/pqrepair)

  • [x] ~~[@kaisecheng] PQ data page with corrupt event length #10184~~
  • [x] [@kaisecheng] PQ data checkpoint handle tmp leftovers #11548
  • [x] [@kaisecheng] Revisit pqrepair
    • [x] https://github.com/elastic/logstash/issues/13725 https://github.com/elastic/logstash/pull/13726

Phase 2

Review, triage and/or fix issues in this list https://github.com/elastic/logstash/labels/persistent%20queues

Needs fixing

  • [x] [@kaisecheng] PQ AccessDeniedException on Windows in checkpoint move https://github.com/elastic/logstash/pull/13902
  • [x] [@kaisecheng] queue.max_bytes >= queue.page_capacity check is not done in multipipelines https://github.com/elastic/logstash/pull/13877
  • [x] verify out-of-date firstUnackedPageNum that point to purged page #6592 https://github.com/elastic/logstash/pull/14147
  • [x] PQ signal notfull state when a batch is read but in reality it is still full #6801
  • [x] queue drain should not start shutdown watcher till queue empty https://github.com/elastic/logstash/pull/13934
  • [x] https://github.com/elastic/logstash/pull/13935
  • [x] PQ exception leads to crash upon reloading pipeline by not releasing PQ lock #12005
  • [x] Document that PQ and DLQ should not be set on NFS #12097
  • [x] [Doc] Add missing indicators of supported queue type on Logstash settings #12536

No fix required:

  • [x] Available disk space check for PQ does not count unused space in page files #10047
  • [x] Logstash fails to start when the queue size is bigger than the available memory #11785 Logstash seems to map all PQ pages in open()

Phase 3 & Beyond

Needs discussion

  • [ ] Enhanced Logstash Guaranteed Delivery
  • [ ] https://github.com/elastic/logstash/issues/8458
  • [ ] PQ default 64mb page could hold fewer elements than a configured large batch size #12102
  • [ ] PQ does not use all resources #13906
  • [ ] [performance] look for a more efficient isFull() and getPersistedByteSize() interactions #9038
  • [ ] New queue type (cirular queue) to configurably drop events to avoid upstream blocking from downstream back pressure #11601

Needs watch

  • [ ] files permissions can lead to startup PQ problems #10715
  • [ ] PQ AccessDeniedException on Windows in checkpoint move #12345

Recovery (pkcheck/pqrepair)

  • [ ] Add a new pqdump utility which can dump in JSON all/any data in a queue dir.
  • [ ] Improve automatic recovery at logstash startup

Timeouts and batching

  • Re-assess the state of queue write timeout handling WRT plugins like the http input, anything else required to move forward?
  • [META] Queue timeouts + Batching #9389

Performance

  • There were a few issues about the PQ design performance with the checkpointing strategy #7162 and (large) page files memory mapping #7317 #8801. I do not believe that substantial performance improvements will be prioritized in the the foreseeable future.

  • Decouple write lock and read lock. https://github.com/elastic/logstash/issues/16158

colinsurprenant avatar Dec 17 '19 19:12 colinsurprenant

I was reading this https://www.elastic.co/blog/using-parallel-logstash-pipelines-to-improve-persistent-queue-performance

I see that there is an issue(“To put this another way, a single pipeline can only drive the disk with a single thread. This is true even if a pipeline were to have multiple inputs, as additional inputs in a single pipeline do not increase disk I/O threads.”)

The proposed "Solution" for improving overall performance I would call it a workaround that unfortunately does not work for all cases. We where planing to send syslog -> LS directly no multiple filebeats instances.

Does the support of additional persistent queue threads running in parallel matches any issues described on this meta?

zez3 avatar Feb 18 '21 22:02 zez3

I would also add one more issue.

After filling up the PQ file LS should check if the ES cluster is in read-only state and stop outputting to that cluster. A recheck every x seconds would help

Similar to: https://github.com/elastic/logstash/issues/10023

zez3 avatar Feb 19 '21 06:02 zez3

Greetings, any update on this ?

Thank you.

zalseryani avatar Mar 07 '24 07:03 zalseryani