incubator-gluten [VL] Use conf to control C2R occupied memory

What changes were proposed in this pull request?

In the current design, the Column2Row operation is completed in one go, which consumes a lot of memory and causes the program OOM. In this commit, I modified the C2R operation into multiple operations, which will greatly reduce the peak memory of the C2R operation. In addition, there should be some performance advantages from memory reuse.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Jun 02 '24 09:06 XinShuoWang

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

Jun 02 '24 09:06 github-actions[bot]

Run Gluten Clickhouse CI

Jun 02 '24 09:06 github-actions[bot]

@XinShuoWang Can you update?

Jun 17 '24 23:06 FelixYBW

Run Gluten Clickhouse CI

Jun 20 '24 03:06 github-actions[bot]

Run Gluten Clickhouse CI

Jun 20 '24 06:06 github-actions[bot]

Run Gluten Clickhouse CI

Jun 20 '24 08:06 github-actions[bot]

Run Gluten Clickhouse CI

Jun 25 '24 02:06 github-actions[bot]

@XinShuoWang can you fix the UT?

Jun 26 '24 01:06 FelixYBW

Run Gluten Clickhouse CI

Jun 26 '24 03:06 github-actions[bot]

@XinShuoWang The UT fails as the default buffer size is 64MB and it's too large when task memory is small. You may can change the threshold value for these UTs.

Jun 26 '24 03:06 yma11

Run Gluten Clickhouse CI

Jun 26 '24 03:06 github-actions[bot]

Run Gluten Clickhouse CI

Jul 01 '24 03:07 github-actions[bot]

Run Gluten Clickhouse CI

Jul 01 '24 06:07 github-actions[bot]

@XinShuoWang can you fix the UT? the bug triggered many queries failed in our test.

Jul 02 '24 19:07 FelixYBW

Run Gluten Clickhouse CI

Jul 04 '24 06:07 github-actions[bot]

@XinShuoWang Would you continue to fix the issue? If not we can pickup and fix it. The bug blocked our adoption

Jul 17 '24 00:07 FelixYBW

Run Gluten Clickhouse CI

Jul 18 '24 07:07 github-actions[bot]

Run Gluten Clickhouse CI

Jul 20 '24 04:07 github-actions[bot]

Run Gluten Clickhouse CI

Jul 20 '24 04:07 github-actions[bot]

Run Gluten Clickhouse CI

Jul 20 '24 06:07 github-actions[bot]

Run Gluten Clickhouse CI

Jul 26 '24 07:07 github-actions[bot]

Run Gluten Clickhouse CI

Jul 26 '24 07:07 github-actions[bot]

Run Gluten Clickhouse CI

Jul 26 '24 08:07 github-actions[bot]

Looks like the generator operator already limit the output rowvector size to the one we set.

The OOM isn't caused by C2R, but the R2C after parquet scan, because the struct column is too large. The way to solve OOM is to decrease the batch.

Another operator which may output large batch size is hashjoin. So maybe the PR is still useful.

Jul 29 '24 01:07 FelixYBW

So maybe the PR is still useful.

I am thinking about using the similar mechanism with the one added in https://github.com/apache/incubator-gluten/pull/6009 as a general solution. So we don't have to add such kind of pre-processors for operators one by one @FelixYBW

Jul 29 '24 01:07 zhztheplayer

So maybe the PR is still useful.

I am thinking about using the similar mechanism with the one added in #6009 as a general solution. So we don't have to add such kind of pre-processors for operators one by one @FelixYBW

And the operator may need to be added in Velox to fit into Velox's task processing. We don't want to break whole-stage transformer by the operator in query plan.

Jul 29 '24 01:07 zhztheplayer

#6009

It should be more efficient to break or combine batches in the next operator, instead of implement a dedicated one. Dedicated operator must have a memcpy.

Jul 29 '24 03:07 FelixYBW

ay need to be added in Velox to fit into Velox's task processing. We don't want to break whole-stage transformer by the operator in query plan.

Velox should hornor the max batch size always, or limit the batch by memory size.

Jul 29 '24 03:07 FelixYBW

ow operation is completed in one go, which consumes a lot of memory and causes the program OOM. In this commit, I modified the C2

@XinShuoWang what's your case to have large batch size? My issue can be fixed by reduce spark.gluten.sql.columnar.maxBatchSize and spark.sql.parquet.columnarReaderBatchSize

Jul 29 '24 03:07 FelixYBW

It's caused by a bug when calculating row size.

Jul 31 '24 05:07 FelixYBW