Align all caching modules implemented in spark to rely on dataframes

Open marekhorst opened this issue 5 years ago • 1 comments

Some of the currently implemented caching solutions in spark, namely CachedWebCrawlerJob and PatentMetadataRetrieverJob, are relying on RDDs while we could take advantage of the full potential of spark2 dataframes as it was done in TARA caching (CachedTaraReferenceExtractionJob).

Aug 11 '20 10:08 marekhorst

It will be nice to run some benchmarks to compare RDD-based solution with the dataframes-based one.

Aug 11 '20 10:08 marekhorst