incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

[VL] Use conf to control C2R occupied memory

Open XinShuoWang opened this issue 1 year ago • 15 comments

What changes were proposed in this pull request?

In the current design, the Column2Row operation is completed in one go, which consumes a lot of memory and causes the program OOM. In this commit, I modified the C2R operation into multiple operations, which will greatly reduce the peak memory of the C2R operation. In addition, there should be some performance advantages from memory reuse.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

XinShuoWang avatar Jun 02 '24 09:06 XinShuoWang

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

github-actions[bot] avatar Jun 02 '24 09:06 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jun 02 '24 09:06 github-actions[bot]

@XinShuoWang Can you update?

FelixYBW avatar Jun 17 '24 23:06 FelixYBW

Run Gluten Clickhouse CI

github-actions[bot] avatar Jun 20 '24 03:06 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jun 20 '24 06:06 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jun 20 '24 08:06 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jun 25 '24 02:06 github-actions[bot]

@XinShuoWang can you fix the UT?

FelixYBW avatar Jun 26 '24 01:06 FelixYBW

Run Gluten Clickhouse CI

github-actions[bot] avatar Jun 26 '24 03:06 github-actions[bot]

@XinShuoWang The UT fails as the default buffer size is 64MB and it's too large when task memory is small. You may can change the threshold value for these UTs.

yma11 avatar Jun 26 '24 03:06 yma11

Run Gluten Clickhouse CI

github-actions[bot] avatar Jun 26 '24 03:06 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 01 '24 03:07 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 01 '24 06:07 github-actions[bot]

@XinShuoWang can you fix the UT? the bug triggered many queries failed in our test.

FelixYBW avatar Jul 02 '24 19:07 FelixYBW

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 04 '24 06:07 github-actions[bot]

@XinShuoWang Would you continue to fix the issue? If not we can pickup and fix it. The bug blocked our adoption

FelixYBW avatar Jul 17 '24 00:07 FelixYBW

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 18 '24 07:07 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 20 '24 04:07 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 20 '24 04:07 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 20 '24 06:07 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 26 '24 07:07 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 26 '24 07:07 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Jul 26 '24 08:07 github-actions[bot]

Looks like the generator operator already limit the output rowvector size to the one we set.

The OOM isn't caused by C2R, but the R2C after parquet scan, because the struct column is too large. The way to solve OOM is to decrease the batch.

Another operator which may output large batch size is hashjoin. So maybe the PR is still useful.

image

FelixYBW avatar Jul 29 '24 01:07 FelixYBW

So maybe the PR is still useful.

I am thinking about using the similar mechanism with the one added in https://github.com/apache/incubator-gluten/pull/6009 as a general solution. So we don't have to add such kind of pre-processors for operators one by one @FelixYBW

zhztheplayer avatar Jul 29 '24 01:07 zhztheplayer

So maybe the PR is still useful.

I am thinking about using the similar mechanism with the one added in #6009 as a general solution. So we don't have to add such kind of pre-processors for operators one by one @FelixYBW

And the operator may need to be added in Velox to fit into Velox's task processing. We don't want to break whole-stage transformer by the operator in query plan.

zhztheplayer avatar Jul 29 '24 01:07 zhztheplayer

#6009

It should be more efficient to break or combine batches in the next operator, instead of implement a dedicated one. Dedicated operator must have a memcpy.

FelixYBW avatar Jul 29 '24 03:07 FelixYBW

ay need to be added in Velox to fit into Velox's task processing. We don't want to break whole-stage transformer by the operator in query plan.

Velox should hornor the max batch size always, or limit the batch by memory size.

FelixYBW avatar Jul 29 '24 03:07 FelixYBW

ow operation is completed in one go, which consumes a lot of memory and causes the program OOM. In this commit, I modified the C2

@XinShuoWang what's your case to have large batch size? My issue can be fixed by reduce spark.gluten.sql.columnar.maxBatchSize and spark.sql.parquet.columnarReaderBatchSize

FelixYBW avatar Jul 29 '24 03:07 FelixYBW

It's caused by a bug when calculating row size.

FelixYBW avatar Jul 31 '24 05:07 FelixYBW