databend Tracking issue for Cache-service

Cache service will be a cluster of data block cache servers sitting in mid of databend-query and underlying storage service to provide fast data block retrieving.

[x] https://github.com/datafuselabs/databend/pull/6799
[x] Architecture design: https://github.com/datafuselabs/opencache
[ ] Freeze architecture design when the RFC task is closed.

Server-side tasks

The following are still some tracking-issue level tasks:(

[ ] Impl cache replacement policy: modified LIRS
[ ] Impl Chunk store. See arch
[ ] Impl Object store. See arch
[ ] Impl Access store. See arch
[ ] Impl Manifest store. See arch
[ ] Impl service protocol http2 PUT+GET
[ ] Impl configuration and server initialization.
[ ] Impl server restart procedure.

Databend side tasks could be found at: https://github.com/datafuselabs/databend/issues/6803

Jul 24 '22 14:07 drmingdrmer

beside to accelerating block retrieve from s3, the cache service may also plays a good place to store the spilled data from join statemets's shuffle, thus can reduce the OOM caused by skewed joins.

Jul 25 '22 03:07 flaneur2020

beside to accelerating block retrieve from s3, the cache service may also plays a good place to store the spilled data from join statemets's shuffle, thus can reduce the OOM caused by skewed joins.

AFAIK, the latest decision about the temp data generated by a join is to store them in a persistent store such as s3. Since the cache server can not provide a durability guarantee, the temp data has a chance to be evicted when a query execution still needs it.

To support such a requirement, the cache server has to be able to let an application explicitly disable auto eviction for some data. And the application has to purge these caches explicitly when the join job is done.

Maybe it can be a future feature that objects have durability configuration?

Jul 25 '22 03:07 drmingdrmer

I think Client side tasks should be split into two parts:

OpenDAL's opencache service implements Accessor based on the opencache API.
Databend's Temporary Operator defines cluster metadata format.

databend-meta support opencache cluster metadata.

Should databend have special knowledge for caching services? Maybe they should be handled inside opendal client?

Jul 25 '22 08:07 Xuanwo

beside to accelerating block retrieve from s3, the cache service may also plays a good place to store the spilled data from join statemets's shuffle, thus can reduce the OOM caused by skewed joins.

Based on the current design, this feature will be resolved by Temporary Services. The query can store temporary data:

OpenCache can be implemented as a temporary services too ~

Jul 25 '22 08:07 Xuanwo

I think Client side tasks should be split into two parts:

OpenDAL's opencache service implements Accessor based on the opencache API.

Databend's Temporary Operator defines cluster metadata format.

Do you mean that Accessor only connects to one opencache server and lets Operator do the load balancing job, such as with a consistent hash? It would be nice if I understood you correctly.

databend-meta support opencache cluster metadata.

Should databend have special knowledge for caching services? Maybe they should be handled inside opendal client?

I agree:D

Jul 25 '22 08:07 drmingdrmer

Do you mean that Accessor only connects to one opencache server and lets Operator do the load balancing job, such as with a consistent hash?

OpenDAL's opencache client can handle the load balancing job and accept some options:

[cache]
type = "opencache"

[cache.opencache]
endpoints = ["192.168.0.2", "192.168.0.3"]
hash_methods = "ConsistentHash"

OpenDAL's operator is just an Arc wrapper for Accessor, we should wrap the logic inside.

Jul 25 '22 08:07 Xuanwo

adding a namespace-like configuration may make operations on a multi-tenant deployment easier:

cache keys can be isolated in a different namespaces
we can collect some info or restrict usage at the namespace level, like qps, memory limit, rate limit, etc.

Jul 25 '22 10:07 flaneur2020

This tracking issue is composing two things:

Cache support in Databend
Distributed Cache services: OpenCache

I split them into https://github.com/datafuselabs/databend/issues/6803

Jul 25 '22 10:07 Xuanwo

adding a namespace-like configuration may make operations on a multi-tenant deployment easier:

I created a feature request for opencache: https://github.com/datafuselabs/opencache/issues/3

Jul 25 '22 10:07 Xuanwo

adding a namespace-like configuration may make operations on a multi-tenant deployment easier:

cache keys can be isolated in a different namespaces

we can collect some info or restrict usage at the namespace level, like qps, memory limit, rate limit, etc.

I'd prefer not to introduce namespace into the server side:

Splitting the cache server disk space into several sub-space reduces resource efficiency. The reserved space in one sub-space can not be used by another heavily loaded sub-space.
every time a node is added or removed, or a tenant(sub-space) is added or removed, the quota of every sub-space needs to be adjusted.

Maybe it's better to be done on the client side.

Jul 25 '22 13:07 drmingdrmer

Splitting the cache server disk space into several sub-space reduces resource efficienc

oh, do not really need split the storage physically, it can be just a builtin key prefix in the internal storage, like the bucket in boltdb: https://github.com/boltdb/bolt#using-buckets

the operational functionality like quota do not need to be built in the 1st phase, but a design with namespace built-in would make the life easier when multi-tenant workload get into place.

Jul 25 '22 13:07 flaneur2020

Splitting the cache server disk space into several sub-space reduces resource efficienc

oh, do not really need split the storage physically, it can be just a builtin key prefix in the internal storage, like the bucket in boltdb: https://github.com/boltdb/bolt#using-buckets

It's different from a db such as bolt: the access load pattern affects the internal data layout. A tenant who reads a lot from the cache server will aggressively evict every piece of other user's data.

It's not difficult to have a namespace inside the cache server, but it can not be used for throttling without physical isolation.

Jul 25 '22 13:07 drmingdrmer

Splitting the cache server disk space into several sub-space reduces resource efficienc

oh, do not really need split the storage physically, it can be just a builtin key prefix in the internal storage, like the bucket in boltdb: https://github.com/boltdb/bolt#using-buckets

It's different from a db such as bolt: the access load pattern affects the internal data layout. A tenant who reads a lot from the cache server will aggressively evict every piece of other user's data.

It's not difficult to have a namespace inside the cache server, but it can not be used for throttling without physical isolation.

do our cache service planned to have a proxy layer like twemproxy ?

if it is, I guess we can do something in the proxy layer to take some control. it'd be easy to rate limit or restrict memory size in the proxy layer.

if not, it's hard to maintain the namespace concept in server side IMHO, there'd be no such a central "server side" but many peers in a distributed manner q.q

Jul 26 '22 04:07 flaneur2020

IMHO a proxy layer do have its complexities (especially in HA) and need not to be a top priority work.

another way we may consider is a sidecar proxy to coordinate the multi tenant usages, which works more likes a client, it encapsulates the complex parts like distributed coordination & billing logic, just offering a simple GET/SET rpc interface.

maybe likes this: https://github.com/facebook/mcrouter , I remember it's deployed on each host machine to forward mc requests to a big mc pool, and have some additional features like failover, local cache, stats, etc.

@zhihanz @hantmac what's your opinion about this? q.q

Jul 26 '22 04:07 flaneur2020

do our cache service planned to have a proxy layer like twemproxy ?

Hmm... no such plan yet. Meanwhile the proxy job is done by the opendal.

Jul 26 '22 06:07 drmingdrmer

I agree that namespace, proxy, quota, and audit are all useful features. But considering our cache services only have markdown docs now, I prefer to start as simple as possible.

Let's implement the cache service and databend's cache mechanism first. After that, we can have a more profound and solid code base for improvement.

What do you think? @flaneur2020 @drmingdrmer

Jul 26 '22 06:07 Xuanwo

Not used so far. Feel free to re-open it when needed.

Nov 18 '22 04:11 Xuanwo