databend it seems max_file_size does not work

          it seems max_file_size does not work:

Same results on DBeaver. @youngsofun

Originally posted by @soyeric128 in https://github.com/datafuselabs/databend-docs/issues/398#issuecomment-1898918694

Jan 19 '24 01:01 BohuTANG

max_file_size is not guaranteed, so is snowflake

we need parallel processing to speed up. And we need to avoid creating files that are too small due to parallel processing. this goal somewhat conflicts with max_file_size.
compressed size for text files
parquet has format overhead (cannot be as small as 10 bytes), and the size of its data part cannot be known in advance.

there are still works to do to improve. but max_file_size=10 is not possible for parquet.

Jan 19 '24 02:01 youngsofun

max_file_size is not guaranteed, so is snowflake

we need parallel processing to speed up. And we need to avoid creating files that are too small due to parallel processing. this goal somewhat conflicts with max_file_size.

compressed size for text files

parquet has format overhead (cannot be as small as 10 bytes), and the size of its data part cannot be known in advance.

there are still works to do to improve. but max_file_size=10 is not possible for parquet.

Thank you. Do we have a recommended way to use max_file_size at this time? Did you mean there is a minimum value limit for each of the supported formats for max_file_size?

Jan 19 '24 03:01 soyeric128

@soyeric128

no need for that I think. In practice, It's not cost-effective to have files that are too small, no one would realy do this expect for testing. set a minimum value is ok for user, but make our own test harder. e.g. our sql logic test use CSV with max_file_size=1, which one row per file.

user need to know:

we try to limit file size to less than max_file_size, but it is not guaranteed, file size may exceed it a little , user should not depends on it.
choose the proper default size for formats. so users do not need to set it most of the time. snowflake is 16M, It is ok for text file, but a little small for parquet file, Our default size is 256M in code (64M in doc), it is too large for compressed text files. I will adjust it later.

Jan 19 '24 08:01 youngsofun

fixed by https://github.com/datafuselabs/databend/pull/15596

May 30 '24 05:05 youngsofun