clickhouse-docs
clickhouse-docs copied to clipboard
`max_download_buffer_size` docs seem to contradict setting description
https://clickhouse.com/docs/en/integrations/s3#read--writes
max_download_buffer_size. Files will only be downloaded in parallel if their size is greater than the total buffer size combined across all threads
https://clickhouse.com/codebrowser/ClickHouse/src/Core/Settings.h.html#DB::SettingsTraits::Data::max_download_threads
M(UInt64, max_download_buffer_size, 10*1024*1024, "The maximal size of buffer for parallel downloading (e.g. for URL engine) per each thread.", 0)
I assume the later is correct. The former is worded strangely and is confusing.
There are also just no direct docs on this setting either, or the paired max_download_threads setting
yes, it's confusing and a bit wrong.
primarily it is used as buffer size for each thread downloading as described in setting description but we don't use parallel read if the file size is less then 2 * max_download_buffer_size.
So performance wise it’s best to disable it then, when reading from S3?
you can disable it with max_download_threads and that should be the primary setting to control the behaviour. Using max_download_buffer_size to disable a feature seems a bit too abstract, it's just extra control when max_download_threads > 1.
Do you have any reason to disable it?
Well I’d want to make the threads high so s3 downloads with high concurrency, but my understanding is that if the files are too small then it won’t use all the threads. Or is the parallel read per-file?
If you are reading multiple files (e.g. by using glob pattern), they will be read in parallel and it's not connected to the settings you mentioned (it will try to use as many threads as possible).
Let's assume you defined files using glob pattern, and multiple threads are processing those files, each thread can use a parallel read buffer which will spawn additional threads for reading controlled by the settings you mentioned (max_download_*).
I wouldn't put the value of max_download_buffer_size too small because downloading small files in parallel could have a negative effect (there is overhead of making multiple requests instead of a single request).
This is now addressed by http://localhost:3000/docs/en/integrations/s3/performance#using-threads-for-reads ill clean up old docs and PR.
https://github.com/ClickHouse/clickhouse-docs/pull/2882