spark-redis
spark-redis copied to clipboard
Reading redis sets as Key Value Pair RDD
Hi,
I was trying to read multiple sets from redis, according to documentation this is done using val setRDD = sc.fromRedisSet("keyPattern*") but this flattens out all sets, and returns RDD[String]. Is there a way to keep all sets data separate i.e get RDD[(String,Array(String))], where first value of tuple is key and second is the array containing set elements.
for example : If I have following two sets in redis:
key1 -> Set(v1,v2,v3)
key2 -> Set(v4,v5)
calling sc.fromRedisSet("key*") returns RDD[String] = [v1,v2,v3,v4,v5] , is there a way to get RDD[(String,Array[String])] = [ (key1, Array(v1,v2,v3)) , (key2, Array(v4,v5)) ]
HI @yougaindra There is no such API, but you can implement it yourself, here is a similar example https://github.com/RedisLabs/spark-redis/issues/214
Hi @fe2s I tried using following
val keysRDD = sc.fromRedisKeyPattern(keyPattern)
val valsRDDArray = keysRDD.collect.map(k => sc.fromRedisSet(k).map(v => (k,v))
val valsRDD = sc.union(valsRDDArray)
this returns RDD[(String,String)] where first element of tuple represents key and second one represents value
example : if redis contains key1 -> v1,v2,v3 I get valsRDD = [(k1,v1) , (k1,v2) (k1,v3)]
I can then use collect_list/groupBy function to get RDD[(String,Array[String])] as required.
Which one do you think is more efficient, this approach or the one described in #214, also how much is the difference?
Hi @yougaindra , the approach described in #214 should be more efficient. The approach you mentioned will load all the keys on the driver and then trigger a separate spark job for each key. It doesn't scale well.