listenbrainz-server icon indicating copy to clipboard operation
listenbrainz-server copied to clipboard

Speedup stats processing in Spark cluster

Open amCap1712 opened this issue 9 months ago • 0 comments

  1. Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.
  2. Remove Pydantic validation in places where it seemed redundant or of not much use.

Before this PR, an entire stats run took about 9 hours. With step 2, it went down to 6.25 hours and then with step 1 on top of it, it goes down to 5.75 hours.

amCap1712 avatar May 10 '24 11:05 amCap1712