databend icon indicating copy to clipboard operation
databend copied to clipboard

Feature: add copy option `MAX_FILE_WRITE_THREADS`

Open youngsofun opened this issue 1 year ago • 2 comments

Summary

after https://github.com/datafuselabs/databend/pull/15596, the file size control for parquet is improved. but when there are many threads, blocks are likely to eventually be distributed to the writer threads, and result in relative small files.

a grouping processor is used to group small blocks to MAX_FILE_SIZE before distributed to the writer threads. but its based on uncompressed size, so may result in files with size MAX_FILE_SIZE/compress_ratio

user can change the setting max_threads, but this will affect the whole plan.

compress ratio estimator

another automated approach is to enhance the grouping processor with a compress ratio estimator,

  1. compress ratio may be diff from block to block
  2. grouping larger mem of blocks cost more tmp memory

youngsofun avatar May 22 '24 08:05 youngsofun

We can use the /*+ SET_VAR(max_threads=1) */ to only set the copy, no need an new option?

BohuTANG avatar May 22 '24 09:05 BohuTANG

max_threads=1 will slow down the whole query, including the source and computing.

youngsofun avatar May 22 '24 10:05 youngsofun