clickhouse-docs `max_download_buffer_size` docs seem to contradict setting description

https://clickhouse.com/docs/en/integrations/s3#read--writes

max_download_buffer_size. Files will only be downloaded in parallel if their size is greater than the total buffer size combined across all threads

https://clickhouse.com/codebrowser/ClickHouse/src/Core/Settings.h.html#DB::SettingsTraits::Data::max_download_threads

M(UInt64, max_download_buffer_size, 10*1024*1024, "The maximal size of buffer for parallel downloading (e.g. for URL engine) per each thread.", 0)

I assume the later is correct. The former is worded strangely and is confusing.

Aug 24 '23 16:08 danthegoodman1

There are also just no direct docs on this setting either, or the paired max_download_threads setting

Aug 24 '23 16:08 danthegoodman1

yes, it's confusing and a bit wrong. primarily it is used as buffer size for each thread downloading as described in setting description but we don't use parallel read if the file size is less then 2 * max_download_buffer_size.

Sep 01 '23 14:09 antonio2368

So performance wise it’s best to disable it then, when reading from S3?

Sep 01 '23 21:09 danthegoodman1

you can disable it with max_download_threads and that should be the primary setting to control the behaviour. Using max_download_buffer_size to disable a feature seems a bit too abstract, it's just extra control when max_download_threads > 1.

Do you have any reason to disable it?

Sep 04 '23 06:09 antonio2368

Well I’d want to make the threads high so s3 downloads with high concurrency, but my understanding is that if the files are too small then it won’t use all the threads. Or is the parallel read per-file?

Sep 04 '23 23:09 danthegoodman1

If you are reading multiple files (e.g. by using glob pattern), they will be read in parallel and it's not connected to the settings you mentioned (it will try to use as many threads as possible).

Let's assume you defined files using glob pattern, and multiple threads are processing those files, each thread can use a parallel read buffer which will spawn additional threads for reading controlled by the settings you mentioned (max_download_*). I wouldn't put the value of max_download_buffer_size too small because downloading small files in parallel could have a negative effect (there is overhead of making multiple requests instead of a single request).

Sep 06 '23 07:09 antonio2368

This is now addressed by http://localhost:3000/docs/en/integrations/s3/performance#using-threads-for-reads ill clean up old docs and PR.

Dec 10 '24 17:12 gingerwizard

https://github.com/ClickHouse/clickhouse-docs/pull/2882

Dec 10 '24 17:12 gingerwizard

clickhouse-docs clickhouse-docs copied to clipboard

`max_download_buffer_size` docs seem to contradict setting description

clickhouse-docs
clickhouse-docs copied to clipboard