rediscala Using set for a large number gets slower as we move from 100,000 to 1 million keys

I'm trying to run a simple test by putting 1 million keys-values into Redis. For 100,000 keys it is really fast. However, the performance degrades a lot when I bump the number of operations to 1 million. The max. heapspace is 12G and I'm running this on a Macbook pro. As you can see the network write drops significantly after sometime. Not sure what's going on here. Any help would be really appreciated.

I'm using the following versions:
"com.etaty.rediscala" %% "rediscala" % "1.4.0" scalaVersion := "2.11.4"

package redisbenchmark

import akka.util.ByteString

import scala.concurrent.{Future} import redis.RedisClient

import java.util.UUID

object RedisLocalPerf {

def main(args:Array[String]) = {
implicit val akkaSystem = akka.actor.ActorSystem()

var numberRuns = 1000 //default to 100
var size = 1
if( args.length == 1 )
  numberRuns = Integer.parseInt(args(0))


val s = """How to explain ZeroMQ? Some of us start by saying all the wonderful things it does. It's sockets on steroids. It's like mailboxes with routing. It's fast! Others try to share their moment of enlightenment, that zap-pow-kaboom satori paradigm-shift moment when it all became obvious. Things just become simpler. Complexity goes away. It opens the mind. Others try to explain by comparison. It's smaller, simpler, but still looks familiar. Personally, I like to remember why we made ZeroMQ at all, because that's most likely where you, the reader, still are today.How to explain ZeroMQ? Some of us start by saying all the wonderful things it does. It's sockets on steroids. It's like mailboxes with routing. It's fast! Others try to share their moment of enlightenment, that zap-pow-kaboom satori paradigm-shift moment when it all became obvious. Things just become simpler. Complexity goes away. It opens the mind. Others try to explain by comparison. It's smaller, simpler, but still looks familiar. Personally, I like to remember why we made ZeroMQ at all, because that's most likely where"""

val msgSize = s.getBytes.size
val redis = RedisClient()
implicit val ec = redis.executionContext

val futurePong = redis.ping()
println("Ping sent!")
futurePong.map(pong => {
  println(s"Redis replied with a $pong")
})

val random = UUID.randomUUID().toString
val start = System.currentTimeMillis()
val result: Seq[Future[Boolean]] = for {i <- 1 to numberRuns} yield {
  redis.set(random + i.toString, ByteString(s))
}
val res: Future[List[Boolean]] = Future.sequence(result.toList)
val end = System.currentTimeMillis()
val diff = (end - start)
println(s"for msgSize $msgSize and numOfRuns [$numberRuns] time is $diff ms ")
akkaSystem.shutdown()

}

Dec 20 '14 05:12 soumyasd

Yes i noticed it during my tests, at some point the scale is exponential (bad). I suspected the thread scheduler to be the limitation. Or the way Future.sequence works.

If you can isolate a test that scale linearly up to 1M of futures, I would be interested to see it.

By replacing akka-io with another java.nio library (xnio) I was able to pass the 1M req (at the speed of around 500k req/s)

Dec 20 '14 12:12 etaty

@etaty thanks for your response. I'm trying to isolate this now. Can you tell me how you replaced akka-io with xnio ? I've also asked help from the Akka folks. I'll keep you posted.

Dec 20 '14 13:12 soumyasd

Well it was an experiment, no something ready for prod and integrated with rediscala, it was also a year ago.

Dec 20 '14 14:12 etaty

@etaty - the Akka team thinks its the lack of backpressure in the Akka client. That is the client is being flooded with so many requests that it cannot handle them. How many RedisClient you can really create. I was thinking about creating 10 client and giving 100K operations to each of these. What do you think about this setup ?

Dec 20 '14 14:12 soumyasd

Yes it is much more fair Do you have a link to the conversation with akka team ? (is it public ?)

Dec 20 '14 14:12 etaty

Sure. https://groups.google.com/forum/#!topic/akka-user/NrSkEwMrS3s

I think this use case even though it not that common needs to be addressed in one way or the other. Hopefully we can find a good resolution to this issue.

Thanks for responding quickly.

Dec 20 '14 14:12 soumyasd

@etaty - I've put an update to this on the Akka thread in case you want to have a look at it. https://groups.google.com/forum/#!topic/akka-user/NrSkEwMrS3s

Dec 21 '14 16:12 soumyasd

You Could use a "DummyRedisClient" I mean just send Promise to an actor, and the actor complete the Promise. You could use that base to compare when you add an akka-stream in front of the actor and without it. Also look at the GC

Dec 21 '14 17:12 etaty

Also another piece to look at, is akka-io the buffers and osx buffers. Because maybe once they are all full they might start to be slow.

Dec 21 '14 17:12 etaty

@etaty thanks for reverting back.

How is sending to the DummyRedisClient different from sending it to the RedisClient ? The GC pressure is high (at least for young generation (Edge space) ). I gave the process a max of 12G of heap space but it didn't make a big difference.

As I've mentioned in the Akka thread, when i bump the number of RedisClients to 100 and give each of these actors 10K values to put in Redis, it is significantly and noticeable faster. Its is not clear to me exactly why.

Looks like akka-io (TCP) actor doesn't have an internal buffer. Please see http://doc.akka.io/docs/akka/snapshot/scala/io-tcp.html "The basic model of the TCP connection actor is that it has no internal buffering (i.e. it can only process one write at a time, meaning it can buffer one write until it has been passed on to the O/S kernel in full). Congestion needs to be handled at the user level, for both writes and reads."

I'm going to look at OSX buffers now.

If you look at the network IO graph. The amount of writes to the TCP socket goes up till the first read on the socket. As soon as the first read is done the write speeds drops drastically.

Dec 21 '14 20:12 soumyasd

I did post an update on the Akka thread. Looks like I was not applying the back pressure correctly. Now I can get much better performance compared to my initial version. I need to tweak it a little bit more to get better performance.I'll measure it and report it once I've a better version.

Here is how the read/writes look like after applying back pressure.

rediscala_network_io_1actor_backpressure

Dec 22 '14 03:12 soumyasd

Hey thanks for your work. Could you publish the code in gist (or a repo if you prefer), so other can help you ? In the first graph, the write of 24MB/s is "normal" because rediscala is designed to batch write between 2 socket writes. But after it is strange. Maybe if you can put in parallel CPU, MEM and GC. Are you using visualvm ?

Dec 22 '14 10:12 etaty

@etaty - thank you for responding. I like Rediscala and would like to figure out a way around this limitation.

I've create a gist here. https://gist.github.com/soumyasd/ac5b1d5f2ec3af21ace8

I've also updated some screenshots and analysis on the Akka thread in case you want to take a look at that.

I use YourKit for the profiling my JVM.

rediscala_network_io_5million_backpressure_cpu_memory

rediscala_network_io_5million_backpressure_threads

rediscala_network_io_5million_backpressure

Dec 22 '14 23:12 soumyasd

@etaty - is the TCP NO_DELAY setting for Rediscala configurable ? I think it's waiting for each Future to complete before sending the next one.

Dec 23 '14 02:12 soumyasd

@etaty - is there a way to batch multiple Redis Operations using the RedisClient ?

Dec 23 '14 06:12 soumyasd

Here is another chart for 1 million messages of 10K each. So support for batch mode will greatly increase the throughput and also reduce the latency. Let me know what you think.

rediscala_network_io_1million_10kmsgsize

Dec 23 '14 21:12 soumyasd

You can do a batch with a Transaction message https://github.com/etaty/rediscala/blob/ab29ee24b024b8dced58f618d6a32bb91ba91bf2/src/main/scala/redis/actors/RedisClientActor.scala#L28 A transaction contains a sequence of Operations https://github.com/etaty/rediscala/blob/ab29ee24b024b8dced58f618d6a32bb91ba91bf2/src/main/scala/redis/Operation.scala#L31 it is used to emulate Redis transaction

Dec 23 '14 22:12 etaty

@etaty Thanks. I tried with transactions (batching) and it looks like I can get a better throughput of course at the cost of latency.

Here is the code https://gist.github.com/soumyasd/ac5b1d5f2ec3af21ace8#file-redisstreamclient-scala-L52-L60

5million_1k_messages_grouped100

I think a flag in Rediscala to control the TCP NO_DELAY will be useful.

If you can think of any other optimization please let me know.

Also, is there support for Redis protocol over UDP ?

Dec 23 '14 23:12 soumyasd

I am using the Nagle algo in rediscala (so i am waiting for the ack of a socket write before sending the next write). It limits the number of messages flowing into the akka-io-worker (and so the number of system call). That's why rediscala is much faster. http://en.wikipedia.org/wiki/Nagle's_algorithm It might also be the cause of the "halting" with 1M, because the worker receive 1message with a huge number of operations.

No redis works only with tcp

Dec 23 '14 23:12 etaty

Okay. So is there is no more optimization you can think of ?

What message sizes did you use in the following - "By replacing akka-io with another java.nio library (xnio) I was able to pass the 1M req (at the speed of around 500k req/s)"

Dec 23 '14 23:12 soumyasd

just get or ping (small message)

Dec 23 '14 23:12 etaty