tpcds
tpcds copied to clipboard
Shorter row count threshold of parallel
There is a hard-coded 1 million row counts threshold of Parallel
, which is not friendly for distributed computing engines to generate data in parallel.
https://github.com/trinodb/tpcds/blob/8a02abbba864feedc2afd078c8153d66a95bb2d4/src/main/java/io/trino/tpcds/Parallel.java#L26-L36
For example, in Spark, generate tpcds.sf1.web_sales
in single thread cost 23s.
WDYT to set the threshold to 1k or 10k instead of 1m?
cc @ebyhr