openlibrary Tweak solr cache configs

We currently have two caches that have very high hit rates, the filterCache and the perSegFilter cache

filterCache is I believe used whenever with use the fq parameter. We use this a lot of things like filtered author queries, reading log queries, etc.

perSegFilter is apparently "custom cache currently used by block join". I believe we use block join indirectly for our edition-aware queries, which power huge chunks of the site.

Increasing the sizes of these caches might improve solr performance.

Stakeholders

@scottbarnes @tfmorris

Nov 14 '25 16:11 cdrini

perSegFilter seems like a dead-end ; it currently has a size of 1 so it wouldn't much benefit from increasing it's max size.

Nov 14 '25 17:11 cdrini

Increasing filterCache seems much more promising; since it's sitting at its max size.

Nov 14 '25 17:11 cdrini

Is there a Solr performance issue? Is there a description of it somewhere?

High cache hit ratios are actually a good thing. The tiny perSegFilter cache has a perfect 100% hit rate.

The hit ratio is a percentage of queries served by the cache, shown as a number between 0 and 1. Higher values indicate that the cache is being used often, while lower values would show that the cache isn’t helping queries very much. Ideally, this number should be as close to 1 as possible.

https://solr.apache.org/guide/solr/latest/configuration-guide/caches-warming.html#monitoring-cache-sizes-and-usage

The 93% cumulative hit ratio for the filterCache seems pretty good. The cache is already almost 200 MB, but if you have lots of extra memory and you want to experiment, you could try increasing the size of the cache and monitoring what it does to the hit rate, but already almost 400M of 430M lookups were served from the cache.

Nov 14 '25 17:11 tfmorris

+1! The high hit ratio made me hone in on this as an effective cache, and seeing that it was often sitting entirely full, made me think that increasing its size could have a performance impact if the hit ratio continued to remain high.

I experimented with going from 512 to 1024 and then 2048. The hit ratio remained ~91, and the storage used went up to ~ 400MB. Overall there was no visible impact on solr performance though:

(First line is 1024, second 2048)

We're not having explicit solr performance issues, but we have been struggling with performance due to increased network traffic from distributed, non-identifying crawlers, often causing Solr strain. I'm tackling the crawlers in a separate approach, but also always on the lookout for improvements we can make that can increase Solr's throughput.

Our slowest solr queries tend to be facet queries, which are supposedly aided by the filterCache.

I did notice one interesting thing, which was that the cache empties ~every minute. This is apparently due to auto soft commits we have every minute for near-real-time. So there might also be some opportunities to modify our cache warming queries to be more inline with the types of queries we actually use:

https://github.com/internetarchive/openlibrary/blob/fe87a75fcf7fc8e02e524335c42f85da64fb38fe/conf/solr/conf/solrconfig.xml#L517-L588

Nov 17 '25 23:11 cdrini

Here are all the caches, if of interest:

Nov 17 '25 23:11 cdrini

I experimented with going from 512 to 1024 and then 2048. The hit ratio remained ~91, and the storage used went up to ~ 400MB. Overall there was no visible impact on solr performance though:

That means that the extra memory is being wasted, so the next thing to try is to go the opposite direction and see how much you can reduce the cache size without having a significant reduction in cache hit rate. Memory which is not helping cache effectiveness can be better used for system I/O buffers or other things that Solr needs.

Nov 21 '25 21:11 tfmorris

p.s. You can confirm this with experimentation, but I have a sneaking suspicion that you might only have four important filters (with very large result sets), fq=type:edition, fq=type:work, fq=type:author, fq=type:subject

Nov 21 '25 21:11 tfmorris

It's unfortunately a bit tricky to measure whether a change causes a performance improvement ; the change is still up and I'm monitoring. It would be nice to have a way to run a stress test to get more reliable numbers. From monitoring, the performance does appear to generally be a bit better ; seeing fewer instance of solr strain. But we've made a cocktail of changes in an attempt to quell the rather debilitating performance issues we've had over the past few weeks and restore service, so it's unclear which change had the strongest impact.

Dec 08 '25 20:12 cdrini