iceberg Spark: Fix a separate table cache being created for each rewriteFiles

Spark: Fix a separate table cache being created for each rewriteFiles

Open manuzhang opened this issue 1 year ago • 11 comments

Currently, during Spark's rewrite data files procedure with bin pack strategy, SparkSession is cloned to disable AQE in each rewriteFiles. Since a cloned SparkSession has its own state, V2SessionCatalog is reloaded every time and a separate table cache is created. That means each file group has its own table cache and effectively disables the table cache.

This PR fixes it by cloning SparkSession when creating SparkBinPackStrategy.

Jul 30 '22 08:07 manuzhang

@rdblue @aokolnychyi @kbendick Please help review and suggest where to add a UT. I haven't found a proper place.

Jul 30 '22 08:07 manuzhang

I'm interested to hear what @szehon-ho and @RussellSpitzer think about this.

My initial reaction is that this is not something that we should change. We don't want to disable AQE for other Spark work, which is a side-effect of this change. I also don't like that we need to create a new Spark session for each rewrite, but I don't think there is much we can do to avoid it if we want to disable AQE. We could also fail if AQE is on or just accept the AQE results.

Also, is a separate table cache a bug? Since it is only used once, what is the problem with doing it this way? Sure, this won't cache the rewritten table, but is there a behavior problem or are loads just slightly slower?

Jul 30 '22 18:07 rdblue

@rdblue I've moved cloning session and disabling AQE into RewriteDataFilesSparkAction.

When multiple rewrite actions are submitted concurrently as follows, they will block each other when loading table due to locks in shared HiveExternalCatalog. That hurts the overall performance of rewrite actions.

CALL spark_catalog.system.rewrite_data_files(table => 'default.table', options => map('max-concurrent-file-group-rewrites', '200'));

Jul 31 '22 01:07 manuzhang

There is another lock in SessionCatalog#tableExists for Spark versions before https://github.com/apache/spark/pull/31891 since sessionCatalog is also shared.

Jul 31 '22 05:07 manuzhang

Ignore my previous comments I had the caches mistaken.

Jul 31 '22 06:07 kbendick

I will say that I don't love the idea of asking users to disable AQE (e.g. not cloning the session and ensuring AQE is disabled), as much as cloning the session is somewhat of a pain.

People use these statements at the end of queries and disabling AQE would be a bummer.

Jul 31 '22 06:07 kbendick

@manuzhang, it seems reasonable to create a session for the entire rewrite, not just each Spark submission. Is that what was happening before?

Jul 31 '22 16:07 rdblue

Is that what was happening before?

Do you mean session was created for the entire rewrite before?

Aug 01 '22 01:08 manuzhang

Do you mean session was created for the entire rewrite before?

I'm asking you what the behavior was before this change that you want to fix.

Aug 01 '22 23:08 rdblue

Not sure. The furthest I can track is https://github.com/apache/iceberg/pull/2591 and the behavior is same as now.

Aug 03 '22 02:08 manuzhang

@rdblue Any more concerns or suggestions for this PR?

Aug 04 '22 05:08 manuzhang

@rdblue and @kbendick any more comments?

Aug 18 '22 06:08 manuzhang

Gentle ping @rdblue @aokolnychyi @kbendick for another review

Aug 29 '22 02:08 manuzhang

@ajantha-bhat I forgot to remove cloneSession in SparkSortStrategy and SparkZOrderStrategy. Please review again.

Oct 21 '22 02:10 manuzhang

@ajantha-bhat I forgot to remove cloneSession in SparkSortStrategy and SparkZOrderStrategy. Please review again.

Yeah, last time I gave a comment that these other places also have issues and I forgot to recheck😔 whether it was addressed or not. I think now we cover all the places. Thanks.

Oct 21 '22 02:10 ajantha-bhat

@ajantha-bhat @rdblue any more comments?

Oct 31 '22 03:10 manuzhang

@manuzhang, it seems reasonable to create a session for the entire rewrite, not just each Spark submission. Is that what was happening before?

Yes basically the old behavior would be to clone the Spark Session for each file group rewrite, rather than once for the entire action.

Nov 03 '22 16:11 RussellSpitzer

@RussellSpitzer please check again whether it's fixed now.

Nov 05 '22 02:11 manuzhang

@RussellSpitzer @rdblue @ajantha-bhat please take another look. Thanks.

Nov 11 '22 03:11 manuzhang

I think that all of the session references are to the cloned Spark session now. +1.

Nov 27 '22 22:11 rdblue

Thanks, @manuzhang!

Nov 27 '22 22:11 rdblue

Thanks @rdblue @ajantha-bhat @RussellSpitzer @kbendick @hililiwei for your review. I opened #6284 and #6285 to back-port the PR to Spark 3.2 and 3.1 respectively. Please help review those PRs as well.

Nov 28 '22 03:11 manuzhang

iceberg iceberg copied to clipboard

Spark: Fix a separate table cache being created for each rewriteFiles

iceberg
iceberg copied to clipboard