spark-redis icon indicating copy to clipboard operation
spark-redis copied to clipboard

Reading redis sets as Key Value Pair RDD

Open yougaindra opened this issue 5 years ago • 3 comments

Hi,

I was trying to read multiple sets from redis, according to documentation this is done using val setRDD = sc.fromRedisSet("keyPattern*") but this flattens out all sets, and returns RDD[String]. Is there a way to keep all sets data separate i.e get RDD[(String,Array(String))], where first value of tuple is key and second is the array containing set elements.

for example : If I have following two sets in redis:

   key1 -> Set(v1,v2,v3)
   key2 -> Set(v4,v5)

calling sc.fromRedisSet("key*") returns RDD[String] = [v1,v2,v3,v4,v5] , is there a way to get RDD[(String,Array[String])] = [ (key1, Array(v1,v2,v3)) , (key2, Array(v4,v5)) ]

yougaindra avatar Aug 10 '20 11:08 yougaindra

HI @yougaindra There is no such API, but you can implement it yourself, here is a similar example https://github.com/RedisLabs/spark-redis/issues/214

fe2s avatar Aug 12 '20 09:08 fe2s

Hi @fe2s I tried using following

   val keysRDD = sc.fromRedisKeyPattern(keyPattern)
   val valsRDDArray = keysRDD.collect.map(k => sc.fromRedisSet(k).map(v => (k,v))
   val valsRDD = sc.union(valsRDDArray)

this returns RDD[(String,String)] where first element of tuple represents key and second one represents value example : if redis contains key1 -> v1,v2,v3 I get valsRDD = [(k1,v1) , (k1,v2) (k1,v3)]

I can then use collect_list/groupBy function to get RDD[(String,Array[String])] as required.

Which one do you think is more efficient, this approach or the one described in #214, also how much is the difference?

yougaindra avatar Aug 12 '20 11:08 yougaindra

Hi @yougaindra , the approach described in #214 should be more efficient. The approach you mentioned will load all the keys on the driver and then trigger a separate spark job for each key. It doesn't scale well.

fe2s avatar Aug 12 '20 13:08 fe2s