logstash
logstash copied to clipboard
[META] 8.x PQ improvements
These are the PQ issues that should be looked at in 7.x
Phase 1
Permissions & .lock file
- [x] Logstash fails to acquire lock on PQ lockfile (reentrant?) #10572 PRs: #12023 #12019
- [x] LockException: The queue failed to obtain exclusive access - likely because of not closing the PQ after a write exception, PRs: #12023 #12019
Page file size too small exception
- [x] [@kaisecheng] PQ: improve pqrepair or queue open to handle fully acked 0 byte pages #10855
- [x] [@kaisecheng] Confusing error: Logstash failed to create queue {"exception"=>"Page file size is too small to hold elements" #8480
Docs
- [x] Better explain behaviour of queue.max_bytes setting #10718
- [ ] ~~PQ Docs: add note about upgrading #9762~~
- [x] Doc changes needed to new updatable
queue.page_capacity? #8650
Recovery (pkcheck/pqrepair)
- [x] ~~[@kaisecheng] PQ data page with corrupt event length #10184~~
- [x] [@kaisecheng] PQ data checkpoint handle tmp leftovers #11548
- [x] [@kaisecheng] Revisit pqrepair
- [x] https://github.com/elastic/logstash/issues/13725 https://github.com/elastic/logstash/pull/13726
Phase 2
Review, triage and/or fix issues in this list https://github.com/elastic/logstash/labels/persistent%20queues
Needs fixing
- [x] [@kaisecheng] PQ AccessDeniedException on Windows in checkpoint move https://github.com/elastic/logstash/pull/13902
- [x] [@kaisecheng] queue.max_bytes >= queue.page_capacity check is not done in multipipelines https://github.com/elastic/logstash/pull/13877
- [x] verify out-of-date firstUnackedPageNum that point to purged page #6592 https://github.com/elastic/logstash/pull/14147
- [x] PQ signal notfull state when a batch is read but in reality it is still full #6801
- [x] queue drain should not start shutdown watcher till queue empty https://github.com/elastic/logstash/pull/13934
- [x] https://github.com/elastic/logstash/pull/13935
- [x] PQ exception leads to crash upon reloading pipeline by not releasing PQ lock #12005
- [x] Document that PQ and DLQ should not be set on NFS #12097
- [x] [Doc] Add missing indicators of supported queue type on Logstash settings #12536
No fix required:
- [x] Available disk space check for PQ does not count unused space in page files #10047
- [x] Logstash fails to start when the queue size is bigger than the available memory #11785 Logstash seems to map all PQ pages in open()
Phase 3 & Beyond
Needs discussion
- [ ] Enhanced Logstash Guaranteed Delivery
- [ ] https://github.com/elastic/logstash/issues/8458
- [ ] PQ default 64mb page could hold fewer elements than a configured large batch size #12102
- [ ] PQ does not use all resources #13906
- [ ] [performance] look for a more efficient isFull() and getPersistedByteSize() interactions #9038
- [ ] New queue type (cirular queue) to configurably drop events to avoid upstream blocking from downstream back pressure #11601
Needs watch
- [ ] files permissions can lead to startup PQ problems #10715
- [ ] PQ AccessDeniedException on Windows in checkpoint move #12345
Recovery (pkcheck/pqrepair)
- [ ] Add a new
pqdumputility which can dump in JSON all/any data in a queue dir. - [ ] Improve automatic recovery at logstash startup
Timeouts and batching
- Re-assess the state of queue write timeout handling WRT plugins like the http input, anything else required to move forward?
- [META] Queue timeouts + Batching #9389
Performance
-
There were a few issues about the PQ design performance with the checkpointing strategy #7162 and (large) page files memory mapping #7317 #8801. I do not believe that substantial performance improvements will be prioritized in the the foreseeable future.
-
Decouple write lock and read lock. https://github.com/elastic/logstash/issues/16158
I was reading this https://www.elastic.co/blog/using-parallel-logstash-pipelines-to-improve-persistent-queue-performance
I see that there is an issue(“To put this another way, a single pipeline can only drive the disk with a single thread. This is true even if a pipeline were to have multiple inputs, as additional inputs in a single pipeline do not increase disk I/O threads.”)
The proposed "Solution" for improving overall performance I would call it a workaround that unfortunately does not work for all cases. We where planing to send syslog -> LS directly no multiple filebeats instances.
Does the support of additional persistent queue threads running in parallel matches any issues described on this meta?
I would also add one more issue.
After filling up the PQ file LS should check if the ES cluster is in read-only state and stop outputting to that cluster. A recheck every x seconds would help
Similar to: https://github.com/elastic/logstash/issues/10023
Greetings, any update on this ?
Thank you.