databend icon indicating copy to clipboard operation
databend copied to clipboard

it seems max_file_size does not work

Open BohuTANG opened this issue 1 year ago • 3 comments

          it seems max_file_size does not work: 
Screenshot 2024-01-18 at 12 26 43 PM

Same results on DBeaver. @youngsofun

Originally posted by @soyeric128 in https://github.com/datafuselabs/databend-docs/issues/398#issuecomment-1898918694

BohuTANG avatar Jan 19 '24 01:01 BohuTANG

max_file_size is not guaranteed, so is snowflake

  1. we need parallel processing to speed up. And we need to avoid creating files that are too small due to parallel processing. this goal somewhat conflicts with max_file_size.
  2. compressed size for text files
  3. parquet has format overhead (cannot be as small as 10 bytes), and the size of its data part cannot be known in advance.

there are still works to do to improve. but max_file_size=10 is not possible for parquet.

youngsofun avatar Jan 19 '24 02:01 youngsofun

max_file_size is not guaranteed, so is snowflake

  1. we need parallel processing to speed up. And we need to avoid creating files that are too small due to parallel processing. this goal somewhat conflicts with max_file_size.
  2. compressed size for text files
  3. parquet has format overhead (cannot be as small as 10 bytes), and the size of its data part cannot be known in advance.

there are still works to do to improve. but max_file_size=10 is not possible for parquet.

Thank you. Do we have a recommended way to use max_file_size at this time? Did you mean there is a minimum value limit for each of the supported formats for max_file_size?

soyeric128 avatar Jan 19 '24 03:01 soyeric128

@soyeric128

no need for that I think. In practice, It's not cost-effective to have files that are too small, no one would realy do this expect for testing. set a minimum value is ok for user, but make our own test harder. e.g. our sql logic test use CSV with max_file_size=1, which one row per file.

user need to know:

  1. we try to limit file size to less than max_file_size, but it is not guaranteed, file size may exceed it a little , user should not depends on it.
  2. choose the proper default size for formats. so users do not need to set it most of the time. snowflake is 16M, It is ok for text file, but a little small for parquet file, Our default size is 256M in code (64M in doc), it is too large for compressed text files. I will adjust it later.

youngsofun avatar Jan 19 '24 08:01 youngsofun

fixed by https://github.com/datafuselabs/databend/pull/15596

youngsofun avatar May 30 '24 05:05 youngsofun