reintegrate data object caching

Open mccanne opened this issue 4 years ago • 0 comments

The new zed lake implementation has temporarily removed the data object cache. The old code still was removed and can be found in the package lake/immcache in the commit history. This task is to put it back in to fit with the new lake design.

This will allow significant performance benefits by utilizing the large memory footprint of beefy server instances as currently data is re-read from storage from every query but this would allow for local memory accesses to cached data.

In the old approach, a caching object was passed to the lake, which selectively accessed the cache with explicit logic in the lake code. A better design would be to wrap the cache accesses in the storage layer. This would keep things very simple and when we need tighter integration with the scan planner etc, there's probably a cleaner way to do this compared to having a cache pointer stored in the pool data structure.

In this initial approach to caching, it will simply be a config option to the zed service where the amount of memory to be allocated to caching. In a subsequent task, we can implement a distributed cache across a cluster of local servers dedicated to caching object data thus avoiding accesses to cloud storage. The two caching modes can be used together.

(We previously considered using redis as the caching layer and we should have a look at this compared to the benefit of having a zed-lake aware caching layer.)

Apr 12 '21 20:04 mccanne