iis
iis copied to clipboard
Align all caching modules implemented in spark to rely on dataframes
Some of the currently implemented caching solutions in spark, namely CachedWebCrawlerJob and PatentMetadataRetrieverJob, are relying on RDDs while we could take advantage of the full potential of spark2 dataframes as it was done in TARA caching (CachedTaraReferenceExtractionJob).
It will be nice to run some benchmarks to compare RDD-based solution with the dataframes-based one.