chronon icon indicating copy to clipboard operation
chronon copied to clipboard

Group by upload: use repartition to increase parallelism

Open pengyu-hou opened this issue 1 year ago • 0 comments

Summary

The group by upload input rdd has less number of partitions with compact size. It can leads to executor OOM while converting to chronon row.

Use the default parallelism to improve scalability.

Tested with Relevance team's upload job. The running time got reduced from 40+ mins to less than 15mins.

The downside is that repartition will trigger a shuffle.

Why / Goal

Improve performance.

Test Plan

  • [ ] Added Unit Tests
  • [x] Covered by existing CI
  • [ ] Integration tested

Checklist

  • [ ] Documentation update

Reviewers

@nikhilsimha @hzding621

pengyu-hou avatar Oct 31 '23 03:10 pengyu-hou