duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

[PipelineExecutor] aggregate(DISTINCT expr) not able to be parallelized

Open Tishj opened this issue 3 years ago • 0 comments

What happens?

As the title explains, when an aggregate is called with the DISTINCT flag on an expression, this can not be run in parallel (multi-threaded). Or more specifically, it can not efficiently be run in parallel, because threads need to sink their chunk into the global state right away.

This is done because normally an AggregateOperator will update its local AggregateState directly as data comes in, and then later merge this state with the global state when all values are processed.

When we would take this same approach when DISTINCT is involved, each thread would update its local AggregateState, and ensure that only distinct values are being processed.

The problem comes up when it becomes time to merge the local states with the global state, as we have no way to resolve the issues between threads.

In the example of:

select product(distinct tbl.a) from (select (i%5)+1 as a from range(10000) tbl(i)) as tbl;

Most likely all threads will have encountered all distinct values, but they will have already updated their local AggregateState, so the context of what influenced these updates would already be lost.

To Reproduce

create table tbl as select i % 5 as a from range(100000000) as tbl(i);

pragma enable_profiling;
pragma threads=1;

select count(a) from tbl;
#will run in X miliseconds
select count(distinct a) from tbl;
#will run in Y miliseconds

pragma threads=4;

select count(a) from tbl;
#will likely run 4 times as fast as X
select count(distinct a) from tbl;
#will run in the same time as Y - because it can't benefit from parallelism

OS:

MacOS

DuckDB Version:

4.0.1

DuckDB Client:

CLI

Full Name:

Thijs Bruineman

Affiliation:

DuckDB Labs

Have you tried this on the latest master branch?

  • [X] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] I agree

Tishj avatar Aug 02 '22 14:08 Tishj