BlackLab
BlackLab copied to clipboard
Linguistic search for large annotated text corpora, based on Apache Lucene
It can be useful to skip the cache for debugging purposes, but this causes problems, because the code relies on e.g. `docsCount = searchParam.docsCount().executeAsync().peek();` to return a running docs count...
[Jackson Streaming API](https://www.baeldung.com/jackson-streaming-api#writing-to-json) does most of what DataStream does now. Exceptions are contextList (lists of values for annotations that get a different structure in XML and JSON) and all-in-one status/error...
If a query matches an XML element (e.g. ), it would be nice if the attribute values of the element (such as the paragraph's ID) could be included in the...
We have `Hits.FETCH_HITS_MIN`, which is sometimes added to the requested number of hits to make sure we don't fetch a single hit every time while we're iterating through a list...
If the user wants to know the total number of hits, but doesn't need all the hits, and is not sorting or grouping them, we might not need to instantiate...
Operations that use the forward index tend to be I/O-limited, and the forward index takes up a lot of disk space. As our larger corpora grow, and we've added more...
E.g. if we group by `hit:word:i` (matched word(s), insensitive), and both `cat` and `Cat` appear in the corpus, the identity values `cws:word:i:cat` and `cws:word:i:Cat` denote the same group (insensitive context...
The current `ResultsCache` does not allow for monitoring of counts see: See https://github.com/INL/BlackLab/pull/276#issuecomment-1061083756.
ResultsCache is the name of an alternative BlacklabCache, the name does not reflect its purpose. Find a better a name See https://github.com/INL/BlackLab/pull/276#issuecomment-1060532223
Right now, in some (hopefully rare) scenarios, searches could be in memory twice, wasting memory and CPU. This is because the logic that decides to remove a search (`SearchCacheEntry`) from...