tpcds icon indicating copy to clipboard operation
tpcds copied to clipboard

Shorter row count threshold of parallel

Open pan3793 opened this issue 2 years ago • 1 comments

There is a hard-coded 1 million row counts threshold of Parallel, which is not friendly for distributed computing engines to generate data in parallel.

https://github.com/trinodb/tpcds/blob/8a02abbba864feedc2afd078c8153d66a95bb2d4/src/main/java/io/trino/tpcds/Parallel.java#L26-L36

For example, in Spark, generate tpcds.sf1.web_sales in single thread cost 23s. image

WDYT to set the threshold to 1k or 10k instead of 1m?

pan3793 avatar May 16 '22 07:05 pan3793

cc @ebyhr

pan3793 avatar May 16 '22 07:05 pan3793