jedis icon indicating copy to clipboard operation
jedis copied to clipboard

Add support for high level cluster commands

Open marcosnils opened this issue 9 years ago • 16 comments

Despite having the redis­cli feeling in java it would be great to have the more high level feeling of treating the cluster as one distributed redis instance allowing

  • Possibility to send commands to all cluster nodes in parallel. collect and return results on a per node base / collect and throw error on a per node base
  • Possibility to send commands to a range of cluster nodes in parallel. collect and return results on a per node base / collect and throw error on a per node base
  • High level data structures as return type of cluster operations eg. return same data structure for both cluster nodes (currently one big string) and cluster slaves (List of String).

marcosnils avatar May 17 '15 18:05 marcosnils

I whole heartedly agree. We need to come up with a uniform way to issue commands to clusters that currently only work with single instance. For example, let's take PING

What does ping to a cluster mean?

  • return a single response if all elements of the cluster are OK and false otherwise.
  • return a map of responses from each e.g. Map<HostAndPort, Boolean> for masters only.
  • return the above map response but include the slaves as well.
  • it doesn't make any sense, leave it alone.

Another command for discussion to get things going is keys, specifically keys *. My thoughts are that this should return a Map<HostAndPort, List<String>> of the keys on each of the masters.

Given the above philosophy so to speak, we also need to consider directed commands. E.g., I only want the keys on this master (or this slave).

All this will require interface changes, not something to address lightly.

Much more discussion needed.

allanwax avatar May 18 '15 17:05 allanwax

Just one of concern, should we aggregate results as we're sending request to one Redis server? Ideally Jedis shouldn't know return value's meaning so we shouldn't aggregate (some kind of reduce) it but just add all things to some kind of collection, as @allanwax stated, is fine. Map<ClusterNode or HostAndPort, T> could be sufficient.

HeartSaVioR avatar May 18 '15 21:05 HeartSaVioR

Also, by not aggregating, the requests to the individual servers can be done in parallel more easily (no intermediates). The output map will probably have to be a ConcurrentHashMap since multiple threads would be writing multiple keys to it at the same time.

allanwax avatar May 18 '15 22:05 allanwax

What happens with eval ?

allanwax avatar May 18 '15 22:05 allanwax

I wonder what commands we want to run on all the cluster. We need a few real world use cases. The KEYS on the entire cluster doesn't seem realistic to me. And the PING not sure. Is there any other use case in mind?

On 19:53, Mon, May 18, 2015 allanwax [email protected] wrote:

Also, by not aggregating, the requests to the individual servers can be done in parallel more easily (no intermediates). The output map will probably have to be a ConcurrentHashMap since multiple threads would be writing multiple keys to it at the same time.

— Reply to this email directly or view it on GitHub https://github.com/xetorthio/jedis/issues/990#issuecomment-103241109.

xetorthio avatar May 18 '15 22:05 xetorthio

ping is used as a simple example that still involves a lot of thinking about how it is to work in the future. If we can agree on ping, what I referenced as philosophy, the rest will probably be easy.

Our real world version using keys does not actually enumerate all keys in the cluster but enumerates several hundred/thousands keys with something like keys advertiser123:*.

Another command worth looking into is scan which is the right ways to do keys in a real world environment.

But other commands, how about blpop, a very useful command. But it's argument is a list of keys (for individual lists) and those keys can be on multiple masters in a cluster. I think this is a case where we still return a single value but how do we coordinate 'unblocking' once we have retrieved a result. Something to think about.

bgrewriteaof: does this one mean issue a bgrewriteaof on the masters, on the masters and slaves. This is a command we use at least daily on each of our single redis instances. What does it mean to issue a single bgrewriteaof to a cluster.

If you look at the list of redis commands, some of which are used frequently, and looks for those with multiple keys, you'll find enough examples to think about the issue.

allanwax avatar May 18 '15 23:05 allanwax

Hello, these questions don't only apply to Jedis, but more broadly to the Redis Cluster clients ecosystem. I had the chance to think about these issues in the past, even if I didn't implemented any of the advanced features in the reference redis-rb-cluster client. The reason is that this client had to be the simplest in order to provide a reference implementation for the cluster redirection semantics. However I don't believe clients should be that simple in order to serve well Redis Cluster.

So I think clients basically need five levels of APIs.

  1. The Redis Cluster API where commands are executed via redirections. This is the part that Redis Cluster and the client itself abstract away from the user in a natural way. So GET foo will be automatically (eventually) sent to the right instance, and the reply will be returned to the client. This is possible because there is only a single node authoritative for the key foo. This is the part that every Redis Cluster client must implement, basically.
  2. An API that let the user access each Cluster node in a non abstract way, providing just an old plain Redis connection to each in order to run commands. Something like: redis.nodes.masters[0].ping. So with this API it should be easy to retrieve the list of node objects, possibly easily filtering for masters and slaves, and operate on those nodes independently.
  3. However the same node-level API should be able to also let the user run a command to all the instances at once and return an array of replies or failure codes, so that the user is not forced to iterate every time: redis.masters.ping() or something like that. In this context each reply should probably have also a reference to the node object, so that user of the reply can easily match a given node with a given reply.
  4. An API that let run commands that usually have a specific meaning only in the context of a single instance, in a way that makes more sense for the whole cluster. PING is a trivial example but is functional to the discussion. What should redis.cluster.ping() do in this context? For example return an Hash populated with: a status code (true if every node replied positively to ping, or false), the number of reachable nodes, the min and max RTT, or something like that. This is just an example, every command here will have to be designed. A more clear example is PUBLISH that will be delivered to the first master that is able to reply correctly. There are also a class of commands at this API level that basically don't exist at all in the real Redis API but that are useful, like redis.cluster.used_memory() that will return the sum of all the memory used for each master, and so forth.
  5. Finally, but probably the most important after 1, is the ability to run multiple unrelated commands in a parallel way, cluster-wide. For example I may want to run multiple GET and SET commands at once, and since likely they will not target all the same node, I want the client to send all the queries in parallel and wait for the replies. When all the replies are reached (or when the timeout is), the reply is returned to the user. In this setup I likely pay a single RTT in order to run multiple commands in parallel. Here is the place where we use Redis Cluster to improve the theoretical performances of the single instance under high load.

A few remarks:

  • There are features that are not clearly about point 4 or 5, like redis.cluster.multi_get() that may implement in a more simpler way a command that does not exist (the semantics is different than MGET) by fetching multiple keys from different nodes if needed. So it's feature 5 exposed as a single cluster command like in 4.
  • The API at 5 can be implemented without resorting to threads nor multiplexing! By internally splitting the API to send commands with the API to receive replies, putting the sockets in non blocking more when this feature is needed, so that we send the query to all the nodes, and later enter a loop to read the replies. Since the RTT will likely be similar for all the nodes we get similar results as multiplexing or threading, without the complexity.

I hope this helps even if I understand I'm not able to provide true solutions but mostly just more questions :-) Cheers.

antirez avatar May 19 '15 08:05 antirez

Hi *, I agree on the topic of multi-level API's. A brief look towards other clustered systems reveals different approaches. Some clients hide the clustering aspect at all, some other allow to decide how to deal with certain issues like read preference or talking to a particular node.

I like justify 1 and 2 a bit.

  1. You're right about the authoritative node. However, there are certain scenarios, like geographical distribution, redundancy, in which one wants to query the nearest node or prefers to read from slaves. This tackles obviously command routing but plays somehow in a point between 2 and 3. The client selects the node for routing not just on behalf of the key authority but also on other parameters.
  2. I would talk about a set of nodes, not limiting those to slaves or masters at all. One should be able to use selectors for obtaining a set of nodes that should be included within the invocation.

There is a bunch of commands which operates on multiple keys. While this issue belongs to basic routing, it can also mean pipelining to the client to abstract routing away from the user. MSET and MGET belong clearly to this category. Cluster clients should be able to issue these commands to different nodes and provide the results in a reasonable way to the user. This can play nice with 5 but can also be seen as part of 3.

mp911de avatar May 21 '15 19:05 mp911de

@HeartSaVioR
How to solve the fuzzy search in jedis-cluster ?
Or jedis will support the feature with Next version

jeromeheng avatar May 27 '15 04:05 jeromeheng

I created a draft of an advanced cluster Java API to support your discussion. This draft does not contain all mentioned client levels but can be a start towards a high-level cluster API. You can find the docs at http://redis.paluch.biz/docs/api/snapshots/4.0-SNAPSHOT/com/lambdaworks/redis/cluster/RedisAdvancedClusterConnection.html

mp911de avatar May 29 '15 18:05 mp911de

Thanks for sharing this @mp911de! I think it is very interesting. We've been discussing similar things for Jedis. @antirez To implement 5 using sockets in non blocking mode and looping to read responses would mean that we need to keep track of the number of responses that we expect from every node (to know when to exit to loop). If that is true, then when sending a command to a node (even with a "smart client") we might receive a -MOVED. We'll need to do the standard updates in the slot tables and resend the command to the new node. So we need to store all the commands we sent in order to be able to resend them.

xetorthio avatar Jun 01 '15 11:06 xetorthio

FWIW I've tried to implement a small set of high level functions within spring data redis (see: JedisClusterConnection.java) based on a jedis 3.0 SNAPSHOT.

Commands to multiple nodes are executed in parallel and return a Map<Node, ?> result (JedisClusterConnection.java#L2163) or aggregates of the results where it makes sense. Eg. by default keys() will issue KEYS * on all master nodes aggregating the result (JedisClusterConnection.java#L168), while keys(node) will just return the keys for a certain node (JedisClusterConnection.java#L188).

Would it make sense to have the parallel execution and result on a Map based per node base for a starting point, provide support for the most obvious commands, like keys, dbsize, etc and then move on to more specific features like some kind of ReadPreference eg. SLAVE_PREFERRED indicating to issue read commands against the slave node - this would also allow to later on switch to a more sphisticated mechanism such as NEAREST as @mp911de suggested.

christophstrobl avatar Jun 02 '15 09:06 christophstrobl

Is there an ETA for this ticket? It's open since May 2015.

JonyD avatar Sep 15 '17 14:09 JonyD