incubator-gluten TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark

Backend

VL (Velox)

Bug description

[Expected behavior] Faster query runs compared to OSS Spark [actual behavior] OSS Spark runs in half the time taken by Gluten+Velox Spark.

Spark version

None

Spark configurations

Gluten+Velox+Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cores 5 --num-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=30g --conf spark.shuffler=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

OSS Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cornum-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

System information

Environment: Amazon EMR - 10 workers, 1 driver all m5.4xlarge OS: Amazon Linux 2

Relevant logs

Wondering what you need me to capture that'll help you

Mar 12 '24 21:03 sagarlakshmipathy

Hi @sagarlakshmipathy Can you please also share the performance number per query? on TPCDS the Q72 is still a trouble for gluten and needs some special config. Here's some discussions: https://github.com/apache/incubator-gluten/issues/1775

Are you testing with HUDI tables by any chance? --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog For now the HUDI support is not ready in Gluten. It will actually run with vanilla Spark code, and with a RowtoColumn(memcpy) connect to Gluten native operators. So this will actually bring lots of overhead.

thanks, -yuan

Mar 13 '24 09:03 zhouyuan

Query ID Gluten Velox Spark Hudi (ms) OSS Spark Hudi

1 22040 16699

2 60531 33095

3 61031 25965

4 360561 172286

5 140865 72149

6 48038 22890

7 106637 44359

8 45072 19636

Query ID	Gluten Velox Spark Hudi (ms)	OSS Spark Hudi
1	22040	16699
2	60531	33095
3	61031	25965
4	360561	172286
5	140865	72149
6	48038	22890
7	106637	44359
8	45072	19636

I didn't bother running the rest of them. I am testing Hudi tables with Gluten. Is there a gh issue/discussion I can +1 to?

Mar 14 '24 07:03 sagarlakshmipathy

It is quite likely due to the fallback of scanning HUDI tables. Here's the issue tracker for unified data lake design, ICEBERG and DELTA LAKE are now both supported(not 100%) now. https://github.com/apache/incubator-gluten/issues/3378

Thanks, -yuan

Mar 18 '24 00:03 zhouyuan

incubator-gluten incubator-gluten copied to clipboard

TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

incubator-gluten
incubator-gluten copied to clipboard