spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40476][ML][SQL] Reduce the shuffle size of ALS

Open zhengruifeng opened this issue 3 years ago • 4 comments

What changes were proposed in this pull request?

implement a new expression CollectTopK, which uses Array instead of BoundedPriorityQueue in ser/deser

Why are the changes needed?

Reduce the shuffle size of ALS in prediction

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

zhengruifeng avatar Sep 17 '22 00:09 zhengruifeng

take the ALSExample for example:

import org.apache.spark.ml.recommendation._

case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)

def parseRating(str: String): Rating = {
    val fields = str.split("::")
    assert(fields.size == 4)
    Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
}

val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt").map(parseRating).toDF()

val als = new ALS().setMaxIter(1).setRegParam(0.01).setUserCol("userId").setItemCol("movieId").setRatingCol("rating")

val model = als.fit(ratings)

model.recommendForAllItems(10).collect()

before: image

after: image

the shuffle size in this case was reduced from 298.4 KiB to 130.3 KiB

zhengruifeng avatar Sep 17 '22 00:09 zhengruifeng

@dongjoon-hyun

could you make an independent PR moving TopByKeyAggregator to CollectTopK because that is orthogonal from Reduce the shuffle size of ALS?

It is just the moving from TopByKeyAggregator to CollectTopK that reduce the shuffle size, since the ser/deser is optimized in CollectTopK, let me update the PR description

In addition, we need a test coverage for CollectTopK because we remove TopByKeyAggregatorSuite.

Sure, will update soon

zhengruifeng avatar Sep 18 '22 23:09 zhengruifeng

Thanks. If the PR title is clear, +1 for that.

dongjoon-hyun avatar Sep 18 '22 23:09 dongjoon-hyun

cc @srowen @WeichenXu123

zhengruifeng avatar Sep 19 '22 09:09 zhengruifeng

Merged to master

srowen avatar Sep 22 '22 13:09 srowen

Thanks for the reviews!

zhengruifeng avatar Sep 23 '22 00:09 zhengruifeng

Thanks! :)

WeichenXu123 avatar Sep 23 '22 14:09 WeichenXu123