it seems max_file_size does not work
it seems max_file_size does not work:
Same results on DBeaver. @youngsofun
Originally posted by @soyeric128 in https://github.com/datafuselabs/databend-docs/issues/398#issuecomment-1898918694
max_file_size is not guaranteed, so is snowflake
- we need parallel processing to speed up. And we need to avoid creating files that are too small due to parallel processing. this goal somewhat conflicts with
max_file_size. - compressed size for text files
- parquet has format overhead (cannot be as small as 10 bytes), and the size of its data part cannot be known in advance.
there are still works to do to improve. but max_file_size=10 is not possible for parquet.
max_file_sizeis not guaranteed, so is snowflake
- we need parallel processing to speed up. And we need to avoid creating files that are too small due to parallel processing. this goal somewhat conflicts with
max_file_size.- compressed size for text files
- parquet has format overhead (cannot be as small as 10 bytes), and the size of its data part cannot be known in advance.
there are still works to do to improve. but
max_file_size=10is not possible for parquet.
Thank you. Do we have a recommended way to use max_file_size at this time? Did you mean there is a minimum value limit for each of the supported formats for max_file_size?
@soyeric128
no need for that I think. In practice, It's not cost-effective to have files that are too small, no one would realy do this expect for testing. set a minimum value is ok for user, but make our own test harder. e.g. our sql logic test use CSV with max_file_size=1, which one row per file.
user need to know:
- we try to limit file size to less than max_file_size, but it is not guaranteed, file size may exceed it a little , user should not depends on it.
- choose the proper default size for formats. so users do not need to set it most of the time. snowflake is 16M, It is ok for text file, but a little small for parquet file, Our default size is 256M in code (64M in doc), it is too large for compressed text files. I will adjust it later.
fixed by https://github.com/datafuselabs/databend/pull/15596