spark-sql-perf icon indicating copy to clipboard operation
spark-sql-perf copied to clipboard

Spark-sql-perf data generation got very slow

Open jameszhouyi opened this issue 8 years ago • 4 comments

Hi experts @davies Now i am using the spark-sql-perf to generate TPC-DS 1TB data with enabling partitionTables like tables.genData("hdfs://ip:8020/tpctest", "parquet", true, true, false, false, false) . But found some of big tables(e.g., store_sales) got slower to be completed(about 3hrs on 4-slave nodes). I observed that firstly all data were put in /tpcds_1t/store_sales/_temporary/0, then move to /tpcds_1t/store_sales on HDFS, these 'move' on HDFS took a lot time to complete...If some guys came cross the same issue like me ? How to resolve it ? BTW, we use TPC-DS kit from https://github.com/davies/tpcds-kit

Thanks in advance !

jameszhouyi avatar Oct 21 '16 06:10 jameszhouyi

moving files in HDFS should be very fast, could have a thread dump when it's moving files?

davies avatar Oct 21 '16 06:10 davies

Hi @davies @jameszhouyi , I am unable to load 1 TB data : getting "Container marked as failed: container_1494249330558_0002_01_000005" what parameter did you give spark to run with ? ( executor-memory,driver-memory)?

Thanks!

dark-spark2 avatar May 09 '17 07:05 dark-spark2

Hello folks. I how I make the data gen multi-threaded? It appears I can only run one version of dsdgen to generate the data. If i spin up 4 on 4 compute servers. One takes ownership of the files and the others cant.

v-olmedo avatar May 18 '18 16:05 v-olmedo

@v-olmedo The README describes how to use parallel dsdgens. In particular, the line numPartitions = 100) // how many dsdgen partitions to run - number of input tasks. Note that you need to use dsdgen from https://github.com/databricks/tpcds-kit

juliuszsompolski avatar May 18 '18 19:05 juliuszsompolski