predictionio icon indicating copy to clipboard operation
predictionio copied to clipboard

[PIO-138] Fix batchpredict for custom PersistentModel

Open mars opened this issue 7 years ago • 4 comments

Fixes PIO-138

Switches batch query processing from Spark RDD to a Scala parallel collection. As a result, the pio batchpredict command changes in the following ways:

  • --query-partitions option is no longer available; parallelism is now managed by Scala's parallel collections
  • --input option is now read as a plain, local file
  • --output option is now written as a plain, local file
  • because the input & output files are no longer parallelized through Spark, memory limits may require that large batch jobs be split into multiple command runs.

This solves the root problem that certain custom PersistentModels, such as ALS Recommendation template, may themselves contain RDDs, which cannot be nested inside the batch queries RDD. (See SPARK-5063)

mars avatar Nov 17 '17 23:11 mars

I'm currently testing this change with various engines and large batches.

mars avatar Nov 17 '17 23:11 mars

Tested this new pio batchpredict with all three model types:

  • ✅ custom PersistentModel (ALS Recommendation)
  • ✅ built-in, default model serialization (Classification)
  • ✅ null model (Universal Recommender)

This PR is ready to go!

mars avatar Nov 18 '17 00:11 mars

BTW, I found performance for a large, 250K query batch running on a single multi-core machine is equivalent to the previous Spark RDD-based performance.

mars avatar Nov 18 '17 00:11 mars

This PR stalled due to @dszeto's concerns about removing the distributed processing capability from pio batchpredict. I agree that distributed batch processing is optimal, but do not have a solution for the nested RDDs problem encountered for RDD-based persistent models.

mars avatar Dec 14 '17 20:12 mars