chronon
chronon copied to clipboard
Group by upload: use repartition to increase parallelism
Summary
The group by upload input rdd has less number of partitions with compact size. It can leads to executor OOM while converting to chronon row.
Use the default parallelism to improve scalability.
Tested with Relevance team's upload job. The running time got reduced from 40+ mins to less than 15mins.
The downside is that repartition will trigger a shuffle.
Why / Goal
Improve performance.
Test Plan
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
Checklist
- [ ] Documentation update
Reviewers
@nikhilsimha @hzding621