predictionio [PIO-138] Fix batchpredict for custom PersistentModel

[PIO-138] Fix batchpredict for custom PersistentModel

Open mars opened this issue 7 years ago • 4 comments

Switches batch query processing from Spark RDD to a Scala parallel collection. As a result, the pio batchpredict command changes in the following ways:

--query-partitions option is no longer available; parallelism is now managed by Scala's parallel collections
--input option is now read as a plain, local file
--output option is now written as a plain, local file
because the input & output files are no longer parallelized through Spark, memory limits may require that large batch jobs be split into multiple command runs.

This solves the root problem that certain custom PersistentModels, such as ALS Recommendation template, may themselves contain RDDs, which cannot be nested inside the batch queries RDD. (See SPARK-5063)

Nov 17 '17 23:11 mars

I'm currently testing this change with various engines and large batches.

Nov 17 '17 23:11 mars

Tested this new pio batchpredict with all three model types:

✅ custom PersistentModel (ALS Recommendation)
✅ built-in, default model serialization (Classification)
✅ null model (Universal Recommender)

This PR is ready to go!

Nov 18 '17 00:11 mars

BTW, I found performance for a large, 250K query batch running on a single multi-core machine is equivalent to the previous Spark RDD-based performance.

Nov 18 '17 00:11 mars

This PR stalled due to @dszeto's concerns about removing the distributed processing capability from pio batchpredict. I agree that distributed batch processing is optimal, but do not have a solution for the nested RDDs problem encountered for RDD-based persistent models.

Dec 14 '17 20:12 mars

predictionio predictionio copied to clipboard

[PIO-138] Fix batchpredict for custom PersistentModel

predictionio
predictionio copied to clipboard