delta
delta copied to clipboard
[BUG] Performance downgrade using OPTIMIZE (compaction) with Trino
Bug
Describe the problem
We were happy to see that OSS delta lake supported compaction, so we upgraded and tried to optimize our current tables. However, we observed the performance downgrade after compaction.
We use Starburst Enterprise (Trino) to query the delta lake on the top of S3, and here's our performance test using TPC-DS, and the chart below shows the size and object numbers of each table.
After compaction, each table's files were merged into one file, one example below.
Then we did the benchmark test and the comparsion, the below is the result of the query elapsed time, the left (blue) is before the compaction, and the right (red) is after.
The samiliar downgraded performance was also observed on our data (we have large partitioned tables, and test the benchmarks on select with different filters and complex join queries)
Environment information
- Delta Lake version: 1.2.1
- Spark version: 3.2.1
- Scala version: 2.12
- Starburst/Trino version: 370-e.3
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
- [ ] Yes. I can contribute a fix for this bug independently.
- [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- [x] No. I cannot contribute a bug fix at this time.
@Dearkano Thanks for reporting the issue. Is it possible to share the Trino query execution profile before optimize and after optimize for one of the TPCDS queries (preferably Q21 which seems to have the worst degradation)?
@vkorukanti sure, I will do the test and upload them later
@Dearkano - thanks! Looking forward to getting those query execution profiles so we can help debug this issue!
@vkorukanti @scottsand-db Hi, after we dive into the queries, we found that the root cause is that: After the compaction, the parallelism of Trino is significantly reduced, this is expected since Trino has only one file (in the TPC-DS case) to deal with in each table after compaction. So CPU time after compaction is only 1/2 or even 1/10 compared to before. But in the end, the queries take a longer elapsed time, with consuming fewer resources.
Below is the comparison of the elapsed time, it should be similar to my original post, but this time, we ingested two copies of TPC-DS and then optimized one of them.
And this one is the comparison of the CPU time, we can see that the compaction saves the CPU significantly.
Take one query as an example, the query text is:
select count(*)
from ((select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_customer_sk = customer.c_customer_sk
and d_month_seq between 1197 and 1197+11)
except
(select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1197 and 1197+11)
except
(select distinct c_last_name, c_first_name, d_date
from web_sales, date_dim, customer
where web_sales.ws_sold_date_sk = date_dim.d_date_sk
and web_sales.ws_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1197 and 1197+11)
) cool_cust
This is the execution details of the query before compaction.
This is the execution details of the query after compaction.
Since we do not have cases where in each partition, there are tens of GBs of data (so there will be tens of files after compaction), we do not have benchmarks on cases like this.
In conclusion, the compaction will help Trino save resources while this may sacrifice the parallelism of Trino and make queries slower if the query resource is sufficient.
@vkorukanti Hi, is there any update on this issue? Thank you!
@Dearkano do you have the same problem on your production data? Wondering have you tried changing the max split size in Trino?
@Dearkano From the screenshots you posted, it looks like the number of stages is different? Is the query plan different when running on a compacted data? Wondering if any join optimization are triggered due to less size?
@vkorukanti Thank you for the reply! I have not tried max-split-size, do you mean that we can use this config to increase the parallelism? I attach the query plans here. From my side, the plan is indeed different just like the stage shows. There are 10 fragments in the plan with compaction, and 16 fragments in the plan without compaction. I'm not very familiar with the query plan so I'm not sure if there is any join optimization you mentioned.
Query Plan - Before Compaction.txt Query Plan - After Compaction.txt
@Dearkano, looking at the before and after query plans, it looks like the after compaction query has broadcast joins vs the hash partition join in before compaction query plans.
You could try setting the session property join_distribution_type
to PARTITIONED
to keep the plans same.
Regarding why the broadcast join is slow, need to repro to debug this.
Closing the stale issue. Feel free to reopen it if the above suggestion doesn't work