trino ALTER TABLE ... EXECUTE optimize (file_size_threshold) does not work when writer count is large

I was playing around with ALTER TABLE ... EXECUTE optimize and I noticed that it seems to not work correctly when the number of workers is larger than the optimal number of files that one would expect to be produced by this function.

Consider a cluster of 80 machines and a dataset of 1000 files, each around 5MB.

Running ALTER TABLE table EXECUTE optimize(file_size_threshold => '32MB') does do something that seems correct - it reads in all the data from the 1000 files, and then produces 80 files, all of which appear to be larger than 32MB. However I believe that this was a happy accident: 80 concurrent writing processes does actually produce a roughly ideal size (5000MB /80 = ~60MB per file). And I actually think that this 80 files is always produced even when it is not the correct behavior.

Observe that running with ALTER TABLE table EXECUTE optimize(file_size_threshold => '128MB') does not do the right thing - it still results in 80 files, all of which are less than 128MB.

And, the problem is even worse when setting a custom task_writer_count: if task_writer_count is set to 16, then a total of 1280 (80*16) files are produced, all of which are well under even 32MB.

To me it seems like the number of writers should not be set at a constant when running this function - is this a bug? Or is there something about the behavior of this function I am misunderstanding?

Mar 25 '22 17:03 alec-heif

file_size_threshold is a selector for which files to process. I agree we would also want this (or similar config) to determine desired output file size. For this, we need

use writer scaling https://github.com/trinodb/trino/pull/10614
make writer scaling work with repartitioned data: https://github.com/trinodb/trino/issues/10791

cc @losipiuk @sopel39 @alexjo2144 @radek-starburst @homar

Mar 28 '22 08:03 findepi

@findepi doesn't Trino have a way of knowing approximately how much data it is about to rewrite during an OPTIMIZE? It kind of seems like writer scaling is a bit of a naive algorithm here, right?

My understanding of writer scaling is that it's good when you don't know how much data you will ultimately end up writing, and so you do this 1-2-4-8-16-etc scaling to try to get something sorta reasonable. But in this case we actually know exactly what data we are about to rewrite, both in terms of "Physical Input" bytes in the original parquet files and in terms of "Logical Input" bytes in terms of rows. Surely this info can let us do something smarter than always having to just start from a single writer, right?

Mar 28 '22 19:03 alec-heif

Although I guess there is a complexity that we probably start writing before we are done reading, huh... Still, I feel like the writer scaling algorithm could still be tweaked for this use case in a way that would do much better than the truly naive one. Or maybe we sidestep this by letting the user specify an initial number of writers and a scaling factor per-iteration themselves?

Mar 28 '22 19:03 alec-heif

Now that writer scaling is enabled by default, does this behave better ?

the problem is even worse when setting a custom task_writer_count: if task_writer_count is set to 16, then a total of 1280 (80*16) files are produced, all of which are well under even 32MB.

That problem could be tackled by manually setting writer_min_size to task_writer_count * 32 MB.

cc: @sopel39 @gaurav8297 @electrum

Aug 10 '22 05:08 raunaqmorarka

trino trino copied to clipboard

ALTER TABLE ... EXECUTE optimize (file_size_threshold) does not work when writer count is large

trino
trino copied to clipboard