listenbrainz-server
listenbrainz-server copied to clipboard
Speedup stats processing in Spark cluster
- Write a copy of the listens to HDFS on import of a full dump, this makes speeds up filtering of listens and increases the speed of processing in many cases.
- Remove Pydantic validation in places where it seemed redundant or of not much use.
Before this PR, an entire stats run took about 9 hours. With step 2, it went down to 6.25 hours and then with step 1 on top of it, it goes down to 5.75 hours.