luceneutil
luceneutil copied to clipboard
Large DocValues field retrieval test
I am working on LUCENE-8374 and I would like to have independent verification of my findings. I have looked through luceneutil and it does not seem to have a test for my case. I could try and add a fitting test and make a pull request, but I fear that I might be doctoring it to fit my view on the issue, so I ask for input on this.
What I have found is a performance regression introduced in Lucene 7+ with the sequential API for DocValues. The regression is (not surprisingly) for random access of DocValues fields and gets progressively worse as segment size (measured in #documents) grows. The simplest real-world case is document retrieval as part of a search, with fields that are not stored but which has DocValues enabled.
Compounding factors are
- The field is an integer Numeric (int/long/date)
- The field is DENSE (between 6% and 99% of the documents has a value for the field; the closer to 99% the worse the regression)
- The number of documents in a segment (
maxDocis a usable indicator)
The pathological case is a 100M+ doc index merged down to 1 segment (which we have in abundance), but the effect is measurable even with segments of 2M documents with randomized searches and small result sets.
Does this sound like something that is covered by any of the tests in luceneutil? Maybe the taxi-corpus? I looked, but it seems to use floating point numbers, which aren't as affected.
Maybe the geo benchmarks? They use doc values for computing distance, sorting?
I think it's also possible to turn on SortedSetDVFacets.
But I don't think we have any large (BINARY?) doc values benchmarks -- let's add it!
Maybe the Index ctor (in competition.py) could take a new argument to introduce arbitrary sparsity so that we could compare the sorting/facetirng tasks between a sparse and a dense index, as well as compare the strategies that Toke has been exploring in LUCENE-8374 against the current approach that we use in master?
Aren't geo-coordinates floats? That would work for a partial benchmark, but would not hit the variable bits per value-packing used by whole numbers in the Lucene70DocValues codec.
I wrote the artificial test TestDocValues.testNumericretrievalSpeed for probing bottlenecks. At its heart it simply requests the value from a numeric DocValues-field for batches of docIDs in increasing order.
If luceneutil were for Solr, I'd suggest random searches with the documents being populated by DocValued fields. My knowledge with direct use of Lucene is rusty, so I don't know if the retrieval pattern I describe for Lucene is realistic or if there is a better way?
We already have eg. sorting tasks that read from doc values. See for instance HighTermDayOfYearSort in tasks/wikimedium.10M.nostopwords.tasks. Maybe we could tweak the indexing logic to add ways to add arbitrary ratios of empty documents in order to make doc-value fields arbitrarily sparse. By configuring one competitor to be sparse and the other one to be dense, we could see how much slower queries are with sparse dv fields. And by configuring both competitors to use a sparse index, but one having your optimizations from LUCENE-8374, we could see how much faster sorting is with these enhancements for sparse fields.
Sorting is a fine test for checking for regression. For random access impact measurement it is poor as it extracts the values sequentially with skip-ahead: The skip-cache only activates when the destination is two jumps ahead of the current position. For the BLOCK-cache that is 131K documents, for the DENSE-cache it is 1K documents and for vBPV it is 32K values. So for a 10M document corpus, there will be no (or negative) effect for sorted searches when the hit count exceeds 10K documents (10M/1K)). The relative effect will be larger as the hit count approaches 0, but it is likely to be unmeasurable due to the corresponding amount of jitter for tiny searches.
Retrieval of DocValues-stored fields for delivering top-X documents for a search is quite common for both Elasticsearch & Solr. Is there a test for that in luceneutil or maybe something that can be tweaked to measure it?
For random access impact measurement it is poor as it extracts the values sequentially with skip-ahead
Doc values have been designed to handle sorting and faceting, so these are the use-cases that we should optimize for. I am pretty sure that there are things to improve for these access patterns already, especially for selective queries (hopefully in a way that doesn't introduce regressions with queries that match most documents), like your idea to have a rank index for DENSE blocks.
Retrieval of DocValues-stored fields for delivering top-X documents for a search is quite common for both Elasticsearch & Solr.
This is a use-case for stored fields. I know some users abuse doc values and use them to retrieve values for top hits but I disagree with optimizing for that use-case.
Do you consider Solr's export to be abuse? It requires doc values for delivering the result and is highly affected by the switch to iterative DVs.
I hadn't experimented with export before, so I probed a bit with & without the LUCENE-8374 caches. export seems even more susceptible to caching than document retrieval:
> du -sh /mnt/index/cloud/8.0.0/solr1/server/solr/ns80_shard1_replica_n1/data/index/
890G /mnt/index/cloud/8.0.0/solr1/server/solr/ns80_shard1_replica_n1/data/index/
> curl -s "http://localhost:9090/solr/ns80/select?q=*:*&rows=0" | jq .response.numFound
307171504
> curl -s "http://localhost:9090/solr/ns80/select?q=text:hestevogn&rows=0" | jq .response.numFound
52654
# hash: string, mostly ALL docs has values
> time curl -s "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=hash&lucene8374=true" > /dev/null
real 0m0.525s
user 0m0.012s
sys 0m0.006s
> time curl -s "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=hash&lucene8374=false" > /dev/null
real 0m2.225s
user 0m0.012s
sys 0m0.011s
# content_type_served: string, mostly DENSE
> time curl -s "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_served&lucene8374=true" > /dev/null
real 0m0.449s
user 0m0.016s
sys 0m0.010s
> time curl -s "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_served&lucene8374=false" > /dev/null
real 1m41.449s
user 0m0.018s
sys 0m0.006s
# content_length: int, mostly DENSE, variable bits per value
> time curl -s "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_length&lucene8374=true" > /dev/null
real 0m0.313s
user 0m0.012s
sys 0m0.005s
> time curl -s "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_length&lucene8374=false" > /dev/null
real 7m49.100s
user 0m0.024s
sys 0m0.008s
Usual caveats about our index being pretty much worst case applies.
We seem to have a bit of radio silence, @jpountz . Maybe I could persuade you to signal if you have lost interest in this issue or if you just don't have the time right now?
Sorry I was sure that I replied, but I must have forgotten to hit "Comment" before closing the tab.
Do you consider Solr's export to be abuse? It requires doc values for delivering the result and is highly affected by the switch to iterative DVs.
I'm not too familiar with this feature, but it seems to retrieve 30k documents at once in doc ID order, which is an access pattern that looks like what is done for faceting or sorting. So at first sight it doesn't look like an abuse. I'm a bit confused though as retrieving 30k values means that the access is not really random and that most lookups actually stay within the same block, so the block cache that you are suggesting wouldn't help much while the rank data-structure for dense blocks or some form of skipping for sparse blocks might help?
FWIW it looks like this feature is using advance() while it should use advanceExact().
I am not very familiar with the internal workings of export, so I don't know why it uses advance. Could be it is just an oversight? Regarding my example, there are 52,654 hits for hestevogn, meaning 1 match every 5697 documents. So I agree that only DENSE caching should have an effect. But our int-field content_length (which is vBPV) gets a severe speed up. I am clearly missing something and I find it to be important to figure out what, in order to make a proper test (I did re-run the tests and did compare the outputs for validity). I'll try and find the time to dig deeper next week.
NB: I do not understand the reasoning behind outright refusing to consider changes that makes an unintended but fairly wide-spread use-case work well. But I guess that is a discussion better taken on the mailing list.
It's not clear to me if we need to concern ourselves with the use-case of Solr's /export. I believe the goal is to speed up DocValues access assuming certain access patterns we expect in sorting and/or faceting. Perhaps /export will benefit, perhaps it won't. Am I missing something?
Could be it is just an oversight?
advanceExact was added after we switched to doc-value iterators, I believe we just never changed this call site to use this new API. It looks easy to fix?
I do not understand the reasoning behind outright refusing to consider changes that makes an unintended but fairly wide-spread use-case work well
I'm not refusing to push changes that make the export handler perform better. All I'm saying is that doc values are designed to handle sorting and faceting so this is what we should optimize for. Like David said, odds are that making faceting and sorting perform better will also make the export handler perform better.
@dsmiley Do you find it to be irrelevant if changes to the Lucene codec leads to performance regressions in Solr export?
Avoiding going further into the debate about what this part of the software is for, let me try and take a step back.
The change to an iterative API for doc values lead to a a performance regression for cases with large jumps between the document IDs. This is no surprise - @jpountz mentions up front in LUCENE-7589 that skip lists might be a good idea.
The severity of the regression scales with segment size and compounds with DENSE and vBPV (variable bits per value) fields. Logically both faceting and grouping should be affected, with math saying that the relative effect will be larger as the result set gets smaller.
So another way to ask is: Is it interesting to measure the performance for faceted and grouped searches with a focus on relatively small result sets? If so, next question is which faceting we are talking about: Lucene / Solr / ElasticSearch / Something fourth? Maybe there is a common pattern, e.g. requesting doc values given sorted doc IDs?
Is it interesting to measure the performance for faceted and grouped searches with a focus on relatively small result sets?
Yes. For intance luceneutil has LowTerm, MedTerm and HighTerm tasks to test performance for various numbers of hits. If we can make doc values access much faster for queries that have low numbers of matches with reasonable trade-offs, we should totally do it.
which faceting we are talking about: Lucene / Solr / ElasticSearch / Something fourth? Maybe there is a common pattern
Right they should all follow the same pattern: iterate over hits of the query and do something with the value, so it shouldn't matter.
Right they should all follow the same pattern: iterate over hits of the query and do something with the value, so it shouldn't matter.
Okay, we're getting somewhere now. Next question is how the philosophy is for luceneutil tests.
- One angle is to measure as directly against the agreed core usage as possible. In this case it could mean a test that simply retrieves doc values from different sorted doc ID sets. The downside to this is that all the secondary processing is ignored, which might lead to wrongful conclusions about the impact severity of changes to the codec.
- Another angle is that measurement should be against the full stack, making them more real world like. However, as this is
luceneutil, I presume this means Lucene faceting only? The downside is that the secondary processing in the Lucene stack is not necessarily telling of the secondary processing in the Lucene backed search engines.
So how to test? Maybe do both?
I think sorted queries are interesting for that reason: it is a realistic use-case, yet secondary processing is lightweight as most of the time the new document will not be competitive so the only processing that is done with the value is a long comparison.
FYI: I am currently busy with in-house work. I'll get back to the current issue later.
Regarding the abuse thing, I have posted "DocValues, retrieval performance and policy" on the developer mailing-list to get a better understanding.