gazelle_plugin
gazelle_plugin copied to clipboard
[300G TPCH Benchmark] Analysis of 1.3.1 on single node
Describe the bug I tried to benchmark latest gazelle on TPCH-300G data and found several anamolies.
keypoints : -
- Average runtime gain is 20-25% advertised is 1.5x speed up
- Q1 is very anomalous query degraded around 2x advertised 2x speed up
- Q7,8,9 gain not much as advertised
[Vanilla Conf]
spark.master spark://node-3:7077
spark.eventLog.enabled true
spark.eventLog.dir /hadoop/tmp/eventlog
spark.history.fs.logDirectory /hadoop/tmp/eventlog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.cores 1
spark.driver.memory 10g
spark.driver.maxResultSize 6g
spark.driver.memoryOverhead 2g
spark.executor.memory 33g
spark.executor.memoryOverhead 6g
spark.local.dir /hadoop/tmp
spark.executor.cores 5
spark.ui.showConsoleProgress true
spark.executor.instances 3
spark.sql.warehouse.dir /hadoop/warehouse
spark.kryoserializer.buffer.max 1g
spark.sql.shuffle.partitions 90
spark.sql.join.preferSortMergeJoin false
spark.sql.dynamicPartitionPruning.enabled true
spark.sql.optimizer.dynamicPartitionPruning.enforceBroadcastReuse true
spark.sql.adaptive.advisoryPartitionSizeInBytes 100M
spark.sql.autoBroadcastJoinThreshold 100M
[Gazelle Conf]
spark.master spark://node-3:7077
spark.eventLog.enabled true
spark.eventLog.dir /hadoop/tmp/eventlog
spark.history.fs.logDirectory /hadoop/tmp/eventlog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.memory.offHeap.size=100G
spark.memory.offHeap.enabled=true
spark.driver.cores 1
spark.driver.memory 6g
spark.driver.maxResultSize 1g
spark.driver.memoryOverhead 5g
spark.executor.memory 5g
spark.executor.memoryOverhead 384m
spark.local.dir /hadoop/tmp
spark.executor.cores 5
spark.ui.showConsoleProgress true
spark.executor.instances 3
spark.sql.warehouse.dir /hadoop/warehouse
spark.kryoserializer.buffer.max 1g
spark.sql.shuffle.partitions 90
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.oap.sql.columnar.preferColumnar true
spark.sql.execution.arrow.maxRecordsPerBatch 20480
spark.sql.parquet.columnarReaderBatchSize 20480
spark.sql.inMemoryColumnarStorage.batchSize 20480
spark.oap.sql.columnar.tmp_dir /hadoop/tmp
spark.sql.join.preferSortMergeJoin false
spark.sql.dynamicPartitionPruning.enabled true
spark.sql.optimizer.dynamicPartitionPruning.enforceBroadcastReuse true
spark.sql.adaptive.advisoryPartitionSizeInBytes 100M
spark.sql.autoBroadcastJoinThreshold 100M
[Runtime]
Attaching all my analysis
CPU configuration
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 16
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
Stepping: 4
CPU MHz: 2599.996
BogoMIPS: 5199.99
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear arch_capabilities
Disk : Samsung SSD 970 EVO Plus 2TB
Data_type : local filesystem (Parquet)
[Given in github]
[Populated one]
[Discrepency]
Q1 is verry anomalous, and in Q7,8,9 improvement is not as much advertised To Reproduce Run TPCH 300G on single node on above said configuration
Expected behavior To get around 1.5x speed up as advertised
Additional context Add any other context about the problem here.
FYI .. @weiting-chen @PHILO-HE
Hi @Manoj-red-hat, are you testing with Partitioned TPCH dataset?
Using partitioned table will introduce more overhead spark scheduling as it will generate too many tasks, especially on Q1 with small scale factor, hence the aggregation part(CPU) is not the the main pressure.
Please also note the report is tested with Decimal datatype. I've seen most benefits are from the decimal related aggregations.
-yuan
@zhouyuan I am not using Partitioned TPCH dataset
And after more debugging it looks parquet --> arrow is pretty slow
@attaching plans for your reference, see scan times
@Manoj-red-hat it looks like there are many small tasks in 1st stage, could you please have a try w/ larger partiton? e.g, spark.sql.files.maxPartitionBytes=1073741824