predictionio
predictionio copied to clipboard
[PIO-138] Fix batchpredict for custom PersistentModel
Fixes PIO-138
Switches batch query processing from Spark RDD to a Scala parallel collection. As a result, the pio batchpredict command changes in the following ways:
--query-partitionsoption is no longer available; parallelism is now managed by Scala's parallel collections--inputoption is now read as a plain, local file--outputoption is now written as a plain, local file- because the input & output files are no longer parallelized through Spark, memory limits may require that large batch jobs be split into multiple command runs.
This solves the root problem that certain custom PersistentModels, such as ALS Recommendation template, may themselves contain RDDs, which cannot be nested inside the batch queries RDD. (See SPARK-5063)
I'm currently testing this change with various engines and large batches.
Tested this new pio batchpredict with all three model types:
- ✅ custom PersistentModel (ALS Recommendation)
- ✅ built-in, default model serialization (Classification)
- ✅ null model (Universal Recommender)
This PR is ready to go!
BTW, I found performance for a large, 250K query batch running on a single multi-core machine is equivalent to the previous Spark RDD-based performance.
This PR stalled due to @dszeto's concerns about removing the distributed processing capability from pio batchpredict. I agree that distributed batch processing is optimal, but do not have a solution for the nested RDDs problem encountered for RDD-based persistent models.