gazelle_plugin icon indicating copy to clipboard operation
gazelle_plugin copied to clipboard

[300G TPCH Benchmark] Analysis of 1.3.1 on single node

Open Manoj-red-hat opened this issue 2 years ago • 4 comments

Describe the bug I tried to benchmark latest gazelle on TPCH-300G data and found several anamolies.

keypoints : -

  1. Average runtime gain is 20-25% advertised is 1.5x speed up
  2. Q1 is very anomalous query degraded around 2x advertised 2x speed up
  3. Q7,8,9 gain not much as advertised

[Vanilla Conf]

 spark.master                     spark://node-3:7077
 spark.eventLog.enabled           true
 spark.eventLog.dir               /hadoop/tmp/eventlog
 spark.history.fs.logDirectory    /hadoop/tmp/eventlog
 spark.serializer                 org.apache.spark.serializer.KryoSerializer
 spark.driver.cores   1
 spark.driver.memory  10g
 spark.driver.maxResultSize 6g
 spark.driver.memoryOverhead 2g
 spark.executor.memory 33g
 spark.executor.memoryOverhead 6g
 spark.local.dir /hadoop/tmp
 spark.executor.cores 5
 spark.ui.showConsoleProgress true
 spark.executor.instances 3
 spark.sql.warehouse.dir /hadoop/warehouse
 spark.kryoserializer.buffer.max 1g
 spark.sql.shuffle.partitions 90
 spark.sql.join.preferSortMergeJoin false
 spark.sql.dynamicPartitionPruning.enabled true
 spark.sql.optimizer.dynamicPartitionPruning.enforceBroadcastReuse true
 spark.sql.adaptive.advisoryPartitionSizeInBytes 100M
 spark.sql.autoBroadcastJoinThreshold 100M

[Gazelle Conf]

 spark.master                     spark://node-3:7077
 spark.eventLog.enabled           true
 spark.eventLog.dir               /hadoop/tmp/eventlog
 spark.history.fs.logDirectory    /hadoop/tmp/eventlog
 spark.serializer                 org.apache.spark.serializer.KryoSerializer
 spark.memory.offHeap.size=100G
 spark.memory.offHeap.enabled=true
 spark.driver.cores   1
 spark.driver.memory  6g
 spark.driver.maxResultSize 1g
 spark.driver.memoryOverhead 5g
 spark.executor.memory 5g
 spark.executor.memoryOverhead 384m
 spark.local.dir /hadoop/tmp
 spark.executor.cores 5
 spark.ui.showConsoleProgress true
 spark.executor.instances 3
 spark.sql.warehouse.dir /hadoop/warehouse
 spark.kryoserializer.buffer.max 1g
 spark.sql.shuffle.partitions 90
 spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
 spark.oap.sql.columnar.preferColumnar true
 spark.sql.execution.arrow.maxRecordsPerBatch 20480
 spark.sql.parquet.columnarReaderBatchSize 20480
 spark.sql.inMemoryColumnarStorage.batchSize 20480
 spark.oap.sql.columnar.tmp_dir /hadoop/tmp
 spark.sql.join.preferSortMergeJoin false
 spark.sql.dynamicPartitionPruning.enabled true
 spark.sql.optimizer.dynamicPartitionPruning.enforceBroadcastReuse true
 spark.sql.adaptive.advisoryPartitionSizeInBytes 100M
 spark.sql.autoBroadcastJoinThreshold 100M

[Runtime] image

Attaching all my analysis

CPU configuration

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           16
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
Stepping:            4
CPU MHz:             2599.996
BogoMIPS:            5199.99
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear arch_capabilities

Disk : Samsung SSD 970 EVO Plus 2TB

Data_type : local filesystem (Parquet)

[Given in github] image

[Populated one] image

[Discrepency] image

Q1 is verry anomalous, and in Q7,8,9 improvement is not as much advertised To Reproduce Run TPCH 300G on single node on above said configuration

Expected behavior To get around 1.5x speed up as advertised

Additional context Add any other context about the problem here.

Manoj-red-hat avatar May 23 '22 11:05 Manoj-red-hat

FYI .. @weiting-chen @PHILO-HE

Manoj-red-hat avatar May 23 '22 11:05 Manoj-red-hat

Hi @Manoj-red-hat, are you testing with Partitioned TPCH dataset?

Using partitioned table will introduce more overhead spark scheduling as it will generate too many tasks, especially on Q1 with small scale factor, hence the aggregation part(CPU) is not the the main pressure.

Please also note the report is tested with Decimal datatype. I've seen most benefits are from the decimal related aggregations.

-yuan

zhouyuan avatar Jun 01 '22 03:06 zhouyuan

@zhouyuan I am not using Partitioned TPCH dataset

And after more debugging it looks parquet --> arrow is pretty slow

@attaching plans for your reference, see scan times

vanilla.pdf arrow_q1.pdf

Manoj-red-hat avatar Jun 01 '22 04:06 Manoj-red-hat

@Manoj-red-hat it looks like there are many small tasks in 1st stage, could you please have a try w/ larger partiton? e.g, spark.sql.files.maxPartitionBytes=1073741824

zhouyuan avatar Jun 01 '22 05:06 zhouyuan