org-ql
org-ql copied to clipboard
What's the performance
A brief look through the codebase reveals that here we are using a lot of high-level commands to do the matching.
Org mode on the other hand constructs regular expressions and does the searching in very clever order to maximize performance (for example not even bothering with tags when doing full-text search since that has much higher hit-probabilty, doing the tag-match afterwards is cheap and fast post-process).
Did you do some benchmarks? My agenda spans hundreds of thousands of lines and so a slowdown even just 100% will be quite noticeable.
Please see the notes.org file, as there are several benchmarks and performance notes. Performance has been a priority from the very beginning.
Yes, it's generally slower because it checks every heading with predicates. The tradeoff is that the code is more composable and understandable than the 5-screens-long functions in org-agenda.el. Even so, for a lot of use cases, it's fast enough already. There is already a per-buffer, per-query cache that makes performance fast on repeated queries in unchanged buffers, which helps a lot.
To improve performance, I have an experimental branch which begins to use "preamble" regexps to skip directly to potential matches rather than checking every entry, similar to the Org agenda code, and to the algorithm in helm-org-rifle. https://github.com/alphapapa/org-ql/tree/preamble-re With some clever optimization, and choosing the right predicate in a query to replace with a preamble regexp (need to find a better word than preamble), I think performance can be much better.
Another idea I'd like to explore is to try using threads to yield results one-at-a-time to preserve UI responsiveness even during a long-running query. Since only one elisp thread can run at a time, and it isn't true multitasking, I don't know if that's currently feasible. Another possibility to maintain UI responsiveness might be to use async to run queries in a separate process, but due to the nature of Org, that requires careful setup and propagation of special variables.
Anyway, if you have any feedback that could help me expedite the improvements, that would be appreciated. I know you've dealt with these kinds of issues before. Thanks.
That's kind of what I was thinking we could do, as an analogy to rx. have the beautiful lisp syntax and try to automatically convert as much of it as possible into efficient regexps.
What I want to use this package for is something like your helm-org-rifle but not depend on helm and have more elaborate syntax to cover more options, examples:
- use
*prefix to search the next token inside headers - use
:to represent tags - use
>to represent nesting, so thatfoo > barwill only give me entries which containbarand are children of entries containingfoo.
and so on. Most of this can be expressed in org-ql and what can't we can add. But since this will be interactive it must be fast so performance is critical.
The yielding approach is also nice but makes everything massively more complicated. AFAIK we still can't pause threads... but a lot can be achieved even with timers and input handling (see concurrent sallet for example)
That's kind of what I was thinking we could do, as an analogy to rx. have the beautiful lisp syntax and try to automatically convert as much of it as possible into efficient regexps.
Yes, that would be great. I imagine someone who had experience writing compilers would be able to do that more easily.
What I want to use this package for is something like your helm-org-rifle but not depend on helm and have more elaborate syntax to cover more options
I plan to eventually refactor helm-org-rifle into an org-rifle library and separate UI libraries, one for Helm, one for what's currently the "native" helm-org-rifle-occur that doesn't actually use Helm, and then maybe Ivy/Counsel, Sallet, etc. There's a branch I started on a while ago, but it needs to be rebased.
Ideally I guess it would make sense to have a unified searching backend, rather than one in org-ql and one in org-rifle, and a simpler syntax could be compiled into a sexp-based one.
use
*prefix to search the next token inside headers
That would be a good feature to have in helm-org-rifle. I've thought about adding that myself. I think it can be added without much difficulty.
use : to represent tags
In case you didn't know, that's already in helm-org-rifle.
use > to represent nesting, so that foo > bar will only give me entries which contain bar and are children of entries containing foo.
That would be a great feature too. I've thought about adding parent/child selectors to org-ql, and there's a skip-subtree WIP branch, but I'm not sure how to best integrate that feature. I feel like adding it to the current design would be like bolting it on. It probably needs a better design that would more easily accommodate it.
a lot can be achieved even with timers and input handling (see concurrent sallet for example)
Sounds like I need to study that code. :) Thanks.
BTW, an even wackier idea would be to do background parsing of Org files and cache entries in memory so their attributes could be searched more quickly (that is, assuming that it would actually be quicker to search them that way; some searches might be faster with plain regexp searches in a buffer). Nicolas Goaziou mentioned some similar ideas on the Org mailing list a while back when we discussed speeding up or rewriting the agenda.
@alphapapa indeed I think something like org-lsp could exist... that would be sick indeed.
@alphapapa indeed I think something like org-lsp could exist... that would be sick indeed.
Now that's a great idea. It could be written in any language, could parse Org files much more quickly... Hm, much to think about. :)
A persistent index on disk would be nice, keeping track of tags, properties, etc, across many files.
@jtrakk I've done some prototyping of indexing Org files in SQLite, based on John Kitchin's work. I haven't touched the code in a while, but if you're interested, you can see it here: https://github.com/alphapapa/helm-org-rifle/tree/org-rifle/sandbox
The main problem with it is that files have to be completely reindexed when they change, and that requires deleting every entry for the file from the database, which can be slow for large Org files. Even so, I think performance could be acceptable if the indexer ran in a separate process. It might also be worth considering writing the indexer in another language.
Ideally, the database and indexer would be a separate package that org-ql and helm-org-rifle could depend on.
This is where the LSP kickes in, it can listen to edit events and do "delta reindex" (you don't really need a full rescan if there's just one line added). I suspect to index even a huge org collection would take seconds/minutes at most... LSP has async indexing already built-in so that's ideal candidate.
You can see some of the work I've done on Elsa with the typescript LSP client... it's not elisp but it's very easy for prototyping.
@Fuco1 That's a very interesting idea, but does LSP support Org files? :)
Besides, Org files can change outside of Emacs, and the indexer needs to index files that aren't currently open in Emacs. That's the whole point, really; if we only cared about Org files that are already open in Emacs, there would be little to be gained from an external indexer and database, because helm-org-rifle and Org Agenda are pretty fast for Org files that are already open.
you don't really need a full rescan if there's just one line added
Well, I'm sure better programmers than me know good algorithms for that, but when I think about it, it's not as straightforward as it seems. Was that new line in the middle of an existing entry? Okay, we can reindex that one entry easily enough, delete its old entry from the database, insert the new one, done.
But wait, it's a text file, so that new line has shifted the buffer position of all subsequent entries in the file. How do we fix that? Do we run an UPDATE WHERE query and increment the position of all entries whose position is > the one that changed? Well, first we have to calculate the offset properly, and then run a SELECT WHERE to get all the changed entries, then increment the position of each one, and then do the UPDATE WHERE for all of them.
What if it was more than a simple insert? What if new headings were added or deleted? What if some of them were moved within the file? What if their outline level changed? What if their parent headings changed? Is LSP going to handle all that for me? Is it going to just give me a couple of SQL queries to run to make the database accurate again? Am I going to have to have a dedicated LSP process monitoring all of my Org files? Would that mean that I would have to do all modifications of those files through LSP so it knows about the changes?
I quickly arrive at the conclusion that it's best to reindex changed files from scratch. :)
But like I said, I'm sure there are brilliant programmers out there who could write very clever algorithms to do it efficiently. The question then is, could I maintain their code when they move on? :)
If I'm missing something, please do let me know, because I have a lot to learn, and I'd love to solve this problem in an elegant way.
Well there is no LSP for org, that's what we would write and that's what I'm proposing. Having an indexer is no more work than having a language server except the thin layer to provide the API and the usability huge (plus we can provide org features to vscode/vim/atom...).
Sure the first version can be "dumb" and why not, maybe full rescan is fast enough. Things can be optimized as needed. Language servers usually work in-memory and don't keep persistent records, the assumption is that a full re-scan can be done relatively fast and it's a long running process so stuff is cached.
The language server can watch files on disc as well as those open in an editor: editors emit a "document changed" events and the rewrite on disc would emit something like "the whole file changed, reindex please".
I mean, this is a huge undertaking if we go that way, but I think there is quite a nice pay off.
That sounds very interesting. If I understand correctly, it's almost like proposing an "Org daemon" that would keep Org files in memory and mediate changes to them. I wonder if that could enable an alternative implementation of Org in Emacs that worked on structured data rather than plain text, or something like that (of course the files would still be read and written as plain text).
Sort of related: have you ever used Chicken Scheme or Racket? Chicken compiles to C, and IIUC can build static binaries, which is appealing for distribution in some cases. Racket is very interesting, of course, and I keep hearing that its new implementation built on Chez Scheme will be very fast. Of course, I tend to prefer CL-like lisps over Schemes, but Chicken's and Racket's speed appeals to me, as well as Chicken's simplicity of distribution compared to CL-like image-based distribution (although I don't know a lot about that, either). I don't know what packaging and distribution is like for Racket programs.
Anyway, I've been wondering if Chicken or Racket might be good choices for writing some Org tools outside of Emacs.
TXR Lisp is also very interesting, with some really nice features that I would enjoy using in a Lisp. I keep reading about it (e.g. from its author's comments on Hacker News), and its documentation is incredible, but I don't know anything about its performance.
@alphapapa
The main problem with it is that files have to be completely reindexed when they change, and that requires deleting every entry for the file from the database, which can be slow for large Org files. Even so, I think performance could be acceptable if the indexer ran in a separate process. It might also be worth considering writing the indexer in another language.
FWIW, in my own simplistic little setup, I've been keeping a tag-to-filenames mapping in a json file which I update regularly with a Python script run by cron. I haven't had any issues with this simple strategy, but considering switching to sqlite to get more kinds of data into the index. In CPython it takes a few seconds to reindex a large collection of tags across ~1000 files.
BTW one thing I missed in my first shot was the keeping track of the tag hierarchy structure (different from tag inheritance), so I can query for 'vehicle' and get 'car' like the org-tags-view query language does.
Haven't used either of those, but I think even writing this in elisp would be fast enough. The slowness in org comes not from elisp being slow but mostly from not having any kind of caching and just redoing a lot of work over and over. The server can (and probably should) run alongside your "main" emacs session in a separate (non-server) process.
@jtrakk That's an interesting system, sounds like you've optimized it for your needs. And yes, I often forget about tag hierarchies, because I don't think I use them in any of my tags. The need to declare them ahead of time makes them not very useful to me in general--I can imagine them being useful in specific circumstances. I wish they could be built more dynamically, or that there were a better system for defining them across sets of files.
@Fuco1 That would certainly be an interesting idea, having an "Org server" Emacs process. Let me know if you ever start on one and maybe I can help out. :)
FYI, I pushed some new code today that adds what I've called "preamble regexp" optimizations, which lift some selectors out of the per-entry predicate and into a buffer-wide regexp search. Some queries are now about 1.6x times faster, and some others, like (todo "WAITING"), are about 15x faster, which is a huge improvement in large Org files with thousands of headings. It should be possible to add more optimizations in this pattern as well.
@alphapapa
The need to declare them ahead of time makes them not very useful to me in general
That's a good point which bothered me initially as well. My plan is to occasionally find all the tags with no parents in my files and define parents for those tags. I think this strategy solves that problem for my needs.
FYI, I pushed some new code today that adds what I've called "preamble regexp" optimizations, which lift some selectors out of the per-entry predicate and into a buffer-wide regexp search.

(but really, that's the way to do it and I think the potential there is quite untapped even in the org core itself)
@Fuco1 Haha, an appropriate use of XKCD. :)
I think the potential there is quite untapped even in the org core itself
Well, Org is a sprawling beast, but AFAIK that's the main method used in Org. For example, most of the Org Agenda functions find entries that way, by building regexps that can search across an entire file. That's why it's been faster, in general, than org-ql. Hopefully this can narrow the gap in some cases, and with more work, maybe most of them.
As a follow up, this "regexp preamble" thing is working very well. When a selector has a corresponding preamble regexp, it can be very fast, even in large Org files.
At this point, a big step forward would be to implement a kind of query optimizer that would intelligently order selectors so that the ones with the fastest preambles are first in the query. It wouldn't always be so straightforward, though. For example, with a query like:
(and (todo "TODO")
(regexp "Emacs"))
It would probably be faster to put the regexp selector first, because in a large file with a lot of TODOs, most of them probably don't mention Emacs, so searching directly to Emacs would save a lot of testing of to-do keywords. But another query might not be so straightforward, and I don't know how smart we could make the heuristic. So it eventually comes down to the user either learning enough about how it works to know the fastest way to order the query, or experimenting to find out.
One cool trick that elfeed implements is JIT-compilation of search queries. If you call a function in a map or a loop or something similar this makes a difference.
As for the heuristic, I think regexp search should always go first as those are usually pretty specific (that's what I implemented in my very basic helm-based org search). For the rest, I'm not sure either and I suspect it wouldn't make much difference except for tags and properties where inheritance can play a significant role.
One cool trick that elfeed implements is JIT-compilation of search queries. If you call a function in a map or a loop or something similar this makes a difference.
Yeah, I learned that trick from Chris's code. Both the "predicate" function and "action" function are byte-compiled before running the "query".
As for the heuristic, I think regexp search should always go first as those are usually pretty specific (that's what I implemented in my very basic helm-based org search). For the rest, I'm not sure either and I suspect it wouldn't make much difference except for tags and properties where inheritance can play a significant role.
Sorry, I wasn't clear. In the example, both the (todo) selector and the (regexp) selector are optimized by using a regular expression search to go directly to potential matches before calling the predicate function. The issue is, when they're used in an and, which of those two selectors' regular expressions should be used to search directly in the whole buffer to potential matches. Which one is fastest can depend on the buffer content.
Just out of curiosity a couple years later: has the regex preamble thing been implemented?
Just out of curiosity a couple years later: has the regex preamble thing been implemented?
Yes, long ago.