[Java]Cache RDD/Dataframes using NamedRDD/NamedObject
I am writing spark job in java. runJob() method returns the expected output. Now, I want to cache the output using NamedObjects. It works fine in scala but with Java, it doesn't store anything in Cache. Following is the code I used in Java:
RDD<Tuple2<String, Integer>> rddR = JavaRDD.toRDD(rdd);
Function0 func = new Function0() {
public NamedObject apply() {
return (NamedObject) (new NamedRDD(rddR, true,StorageLevel.MEMORY_AND_DISK()));
}
};
NamedRDD nRDD = new NamedRDD(rddR, true, StorageLevel.MEMORY_AND_DISK());
obj.namedObjects().update("rdd:wordcounts-result", func,
obj.namedObjects().defaultTimeout(), rddPersister);
Where obj is the class object and persister is:
static NamedObjectPersister<NamedRDD<RDD>> rddPersister = new RDDPersister<RDD>();
It runs fine without any error, but when I perform get on the namedObjects, it shows blank:
obj.namedObjects().get("rdd:wordcounts-result",obj.namedObjects().defaultTimeout())
What should be the issue and what's the correct code to use for RDD caching in Java?
Any help is really appreciated..!!! Thanks
Hi Nishu,
I am also trying to write java applications, where RDDs can be shared across the same context. I didn't find any suitable documentation or API to use NamedObjects. Were you able to implement this.
Thanks, Vishal
@hntd187 is this fixed in master by any chance?
@nishutayal @gavishal @hntd187 Has the problem been solved?