These are the PQ issues that should be looked at in 7.x

Phase 1

Permissions & .lock file

[x] Logstash fails to acquire lock on PQ lockfile (reentrant?) #10572 PRs: #12023 #12019
[x] LockException: The queue failed to obtain exclusive access - likely because of not closing the PQ after a write exception, PRs: #12023 #12019

Page file size too small exception

[x] [@kaisecheng] PQ: improve pqrepair or queue open to handle fully acked 0 byte pages #10855
[x] [@kaisecheng] Confusing error: Logstash failed to create queue {"exception"=>"Page file size is too small to hold elements" #8480

Docs

[x] Better explain behaviour of queue.max_bytes setting #10718
[ ] ~~PQ Docs: add note about upgrading #9762~~
[x] Doc changes needed to new updatable queue.page_capacity ? #8650

Recovery (pkcheck/pqrepair)

[x] ~~[@kaisecheng] PQ data page with corrupt event length #10184~~
[x] [@kaisecheng] PQ data checkpoint handle tmp leftovers #11548
[x] [@kaisecheng] Revisit pqrepair
- [x] https://github.com/elastic/logstash/issues/13725 https://github.com/elastic/logstash/pull/13726

Phase 2

Review, triage and/or fix issues in this list https://github.com/elastic/logstash/labels/persistent%20queues

Needs fixing

[x] [@kaisecheng] PQ AccessDeniedException on Windows in checkpoint move https://github.com/elastic/logstash/pull/13902
[x] [@kaisecheng] queue.max_bytes >= queue.page_capacity check is not done in multipipelines https://github.com/elastic/logstash/pull/13877
[x] verify out-of-date firstUnackedPageNum that point to purged page #6592 https://github.com/elastic/logstash/pull/14147
[x] PQ signal notfull state when a batch is read but in reality it is still full #6801
[x] queue drain should not start shutdown watcher till queue empty https://github.com/elastic/logstash/pull/13934
[x] https://github.com/elastic/logstash/pull/13935
[x] PQ exception leads to crash upon reloading pipeline by not releasing PQ lock #12005
[x] Document that PQ and DLQ should not be set on NFS #12097
[x] [Doc] Add missing indicators of supported queue type on Logstash settings #12536

No fix required:

[x] Available disk space check for PQ does not count unused space in page files #10047
[x] Logstash fails to start when the queue size is bigger than the available memory #11785 Logstash seems to map all PQ pages in open()

Phase 3 & Beyond

Needs discussion

[ ] Enhanced Logstash Guaranteed Delivery
[ ] https://github.com/elastic/logstash/issues/8458
[ ] PQ default 64mb page could hold fewer elements than a configured large batch size #12102
[ ] PQ does not use all resources #13906
[ ] [performance] look for a more efficient isFull() and getPersistedByteSize() interactions #9038
[ ] New queue type (cirular queue) to configurably drop events to avoid upstream blocking from downstream back pressure #11601

Needs watch

[ ] files permissions can lead to startup PQ problems #10715
[ ] PQ AccessDeniedException on Windows in checkpoint move #12345

Recovery (pkcheck/pqrepair)

[ ] Add a new pqdump utility which can dump in JSON all/any data in a queue dir.
[ ] Improve automatic recovery at logstash startup

Timeouts and batching

Re-assess the state of queue write timeout handling WRT plugins like the http input, anything else required to move forward?
[META] Queue timeouts + Batching #9389

Performance

There were a few issues about the PQ design performance with the checkpointing strategy #7162 and (large) page files memory mapping #7317 #8801. I do not believe that substantial performance improvements will be prioritized in the the foreseeable future.
Decouple write lock and read lock. https://github.com/elastic/logstash/issues/16158

Dec 17 '19 19:12 colinsurprenant

I was reading this https://www.elastic.co/blog/using-parallel-logstash-pipelines-to-improve-persistent-queue-performance

I see that there is an issue(“To put this another way, a single pipeline can only drive the disk with a single thread. This is true even if a pipeline were to have multiple inputs, as additional inputs in a single pipeline do not increase disk I/O threads.”)

The proposed "Solution" for improving overall performance I would call it a workaround that unfortunately does not work for all cases. We where planing to send syslog -> LS directly no multiple filebeats instances.

Does the support of additional persistent queue threads running in parallel matches any issues described on this meta?

Feb 18 '21 22:02 zez3

I would also add one more issue.

After filling up the PQ file LS should check if the ES cluster is in read-only state and stop outputting to that cluster. A recheck every x seconds would help

Similar to: https://github.com/elastic/logstash/issues/10023

Feb 19 '21 06:02 zez3

Greetings, any update on this ?

Thank you.

Mar 07 '24 07:03 zalseryani

logstash
logstash copied to clipboard

[META] 8.x PQ improvements

Phase 1

Permissions & .lock file

Page file size too small exception

Docs

Recovery (pkcheck/pqrepair)

Phase 2

Needs fixing

No fix required:

Phase 3 & Beyond

Needs discussion

Needs watch

Recovery (pkcheck/pqrepair)

Timeouts and batching

Performance

logstash logstash copied to clipboard

[META] 8.x PQ improvements

Phase 1

Permissions & .lock file

Page file size too small exception

Docs

Recovery (pkcheck/pqrepair)

Phase 2

Needs fixing

No fix required:

Phase 3 & Beyond

Needs discussion

Needs watch

Recovery (pkcheck/pqrepair)

Timeouts and batching

Performance

logstash
logstash copied to clipboard