orca: implement intra-segment parallel table scan support
Add comprehensive parallel table scan capability to GPORCA optimizer, enabling worker-level parallelism within segments for improved query performance on large table scans.
Key components:
- New CPhysicalParallelTableScan operator and CDistributionSpecWorkerRandom distribution specification for worker-level data distribution
- CXformGet2ParallelTableScan transformation with parallel safety checks (excludes CTEs, dynamic scans, foreign tables, replicated tables, etc.)
- Cost model integration with parallel_setup_cost and efficiency degradation scaling (logarithmic based on worker count)
- DXL serialization/deserialization for CDXLPhysicalParallelTableScan
- Plan translation to PostgreSQL SeqScan nodes with parallel_aware=true
- Rewindability constraints (parallel scans are non-rewindable)
- GUC integration: max_parallel_workers_per_gather controls worker count
Impl https://github.com/apache/cloudberry/discussions/1316
Bench TPCH 10GB
| Query ID | Parallel Duration (s) | Non-parallel Duration (s) | Performance Improvement (s) | Performance Improvement Rate |
|---|---|---|---|---|
| 01 | 16.000000 | 30.000000 | 14.00 | 46.67% |
| 02 | 3.000000 | 4.000000 | 1.00 | 25.00% |
| 03 | 11.000000 | 20.000000 | 9.00 | 45.00% |
| 04 | 7.000000 | 15.000000 | 8.00 | 53.33% |
| 05 | 8.000000 | 8.000000 | 0.00 | 0.00% |
| 06 | 3.000000 | 5.000000 | 2.00 | 40.00% |
| 07 | 5.000000 | 8.000000 | 3.00 | 37.50% |
| 08 | 6.000000 | 8.000000 | 2.00 | 25.00% |
| 09 | 10.000000 | 15.000000 | 5.00 | 33.33% |
| 10 | 6.000000 | 7.000000 | 1.00 | 14.29% |
| 11 | 1.000000 | 2.000000 | 1.00 | 50.00% |
| 12 | 5.000000 | 7.000000 | 2.00 | 28.57% |
| 13 | 4.000000 | 6.000000 | 2.00 | 33.33% |
| 14 | 3.000000 | 5.000000 | 2.00 | 40.00% |
| 15 | 5.000000 | 5.000000 | 0.00 | 0.00% |
| 16 | 2.000000 | 1.000000 | -1.00 | -------- |
| 17 | 34.000000 | 62.000000 | 28.00 | 45.16% |
| 18 | 22.000000 | 28.000000 | 6.00 | 21.43% |
| 19 | 3.000000 | 5.000000 | 2.00 | 40.00% |
| 20 | 6.000000 | 11.000000 | 5.00 | 45.45% |
| 21 | 22.000000 | 25.000000 | 3.00 | 12.00% |
| 22 | 4.000000 | 6.000000 | 2.00 | 33.33% |
Conclusion: With parallel execution, the TPCH queries' total execution time decreased from 284 seconds to 186 seconds, saving 98 seconds in total, with a performance improvement of 34.51%.
What does this PR do?
Type of Change
- [ ] Bug fix (non-breaking change)
- [ ] New feature (non-breaking change)
- [ ] Breaking change (fix or feature with breaking changes)
- [ ] Documentation update
Breaking Changes
Test Plan
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Passed
make installcheck - [ ] Passed
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
- [ ] Followed contribution guide
- [ ] Added/updated documentation
- [ ] Reviewed code for security implications
- [ ] Requested review from cloudberry committers
Additional Context
CI Skip Instructions
Please add more test cases
Please add more test cases
see src/test/regress:installcheck-orca-parallel
Add some cases to test the plan?
Add some cases to test the plan?
maybe after impl parallel hash join .