improved caching
- new caching backend
- new access via transformer.cache()
- cache on multiple columns
If the controls here change (Maybe one is added? Or a class is renamed in core?) the cache record would become invalid, even if it means the same thing semantically. I'm not sure if there is any way around this, though.
I think this is by design - in the case of BR, if the controls change, then the results are assumed to change. Cache results are not intended to be long-lived. Think more for a series of experiments that use the same baseline.
The person who writes
__repr__for a component may be unaware of this use of the value. Within the main pyterrier library, it can be controlled well enough, but in extension libraries will be harder to moderate (e.g., onir_pt).
I think thats why it will throw an error if you try to cache something that isnt cachable, i.e. that doesnt define a reasonable __repr__. Isnt this a case of make it easy to do the first integration, then expose them to more functionality incrementally?
Relatedly pipelines using pt.apply cannot be cached, as any lambda cannot be guarenteed to be the same between invocations.
In general, there will need to be more documentation on how to build custom transformers, particularly if we integrate type checking.
The current setup cannot really handle caching the results of neural re-rankers because the keys can conflict across datasets. (E.g., qid=1 means different things in different datasets.)
In this case the user should cache on=['qid', 'query'].
I feel like I often want to keep a copy of the resulting dataframes, not just get back the performance
See also #163. I guess here we are discussing if caching is the same use case as #163 or not.
There are even some edge cases for BatchRetrieve that could cause conflicts: transformer.search() assigns a qid of 1 by default, conflicting with past search invocations.
Yes, perhaps .search() should use "search01" or something. I didnt want to have a counter though. 'search-%d' % time?
bm25 >> cache('my-cache')
I always had a cache transformer as wrapping a transformer rather than being composed with it. Now I'm not so sure that it cant be addressed, at least internally, by composition.
While this bm25 >> cache of some form may semantically work, I think my idea was to make this faster to write using the ~ operator. Inevitably, it would end up being bm25 >> pt.cache().
What is being keyed on may be addressed by the pipeline validation, when pipeline components have to declare what their inputs and outputs are. Would you prefer to park this until pipeline validation is ready, and then revisit it?
What would be required to make the ChestCacheTransformer also on rerankers?
Caching results on expensive multi-stage rerankers (such as monoT5 & duoT5) is crucial for good computational performance.
Currently, the Concatenation Operator (^) is insufficient for this task If I had a Pipeline: (BM25%1000 ^ (BM25%1000>>monoT5%500) ^ (BM25%1000>>monoT5%500>>duoT5%25) I.e., I want the top 1000 documents, but the top 500 even better sorted, and the top 25 the best sorted, then BM25 would run 3 times, and monoT5 twice (very expensive).
For a project of mine I am doing the optimized version manually (see https://github.com/CodingTil/2023_24---IRTM---Group-Project/blob/main/py_css/models/baseline.py#L71 and https://github.com/CodingTil/2023_24---IRTM---Group-Project/blob/main/py_css/models/base.py#L100), but I believe it would help a lot of users if this was a native feature.
@CodingTil see https://github.com/seanmacavaney/pyterrier-caching for a reranking cacher
We're going to mature that separate package until its ready to integrate into properly into pyterrier.
^ Note that ScorerCache from the package will not be suitable for caching results from DuoT5, since it only caches results based on query and docno. The "caveats" section hints at this, but I'll make it more direct.