duckdb-web icon indicating copy to clipboard operation
duckdb-web copied to clipboard

Expanding on 'a few general directions' in Tuning workloads

Open bjornasm opened this issue 10 months ago • 0 comments

In Tuning workloads under Profiling we can read:

Query plans can point to the root of performance issues. A few general directions:

Avoid nested loop joins in favor of hash joins.
A scan that does not include a filter pushdown for a filter condition that is later applied performs unnecessary IO. Try rewriting the query to apply a pushdown.
Bad join orders where the cardinality of an operator explodes to billions of tuples should be avoided at all costs.

However, there is no more information on these concepts (as I can see, if there is they should be linked). And while they can be found online I think it would be helpful with an example or two for each point. Not least as I think its a grey area what the user can do, and what is automatically done (f.ex with joins).

Avoid nested loop joins in favor of hash joins.

How should this be done by the user? As far as I can tell there is nothing about this in the documentation? Is it to join on specific conditions such as:

....
FROM a
INNER JOIN B
ON
a.column1 = value

rather than

...
ON
a.column1 = range(...., ....) (or sub select)

Try rewriting the query to apply a pushdown.

I understand the pushdown to avoid too complex computations in the (select) query, and rather looking to transform the underlying data such as we can filter as much as possible before doing computations.

Going from the following, which will perform the date_part(...) calculation on all rows each time.

SELECT 
firstname, lastname
FROM people
WHERE
date_part('year', birthdate)  > 1989; 

to

UPDATE people
SET birthyear = date_part('year', birthdate);

SELECT 
firstname, lastname
FROM people
WHERE
birthyear  > 1989

or from the following from which will do the text manipulation every time it is called

SELECT 
firstname, lastname
FROM people
WHERE
CONCAT(left(firstname,1), left(lastname,1)) = 'DB'

to

UPDATE people
SET initials = CONCAT(left(firstname,1), left(lastname,1));

SELECT 
firstname, lastname
FROM people
WHERE
initials = 'DB';

Maybe not the most convincing nor the best examples, but I guess the gist is there?

Maybe some sub-select statement might be better:

From

SELECT lastname, max(age)
FROM
(SELECT firstname, lastname FROM people
INNER JOIN employees
ON 
employees.firstname = people.firstname
AND
employees.lastname = people.lastname)
WHERE
firstname = 'Duck'
GROUP BY lastname;

to

SELECT lastname, max(age)
FROM
(SELECT people.firstname, people.lastname FROM people
INNER JOIN employees
ON 
people.ssn = employees.ssn
WHERE people.firstname = 'Duck')
GROUP BY lastname;

I think there must be better examples, but the main point is to provide some easy to understand examples.

Bad join orders where the cardinality of an operator explodes to billions of tuples should be avoided at all costs.

I'm not sure at what point we override duckdb's query execution plan, so it is a bit hard to find an example here but something like:

SELECT animalname, observationtime, observationnotes, country, area, habitatinformation
FROM animalobservations ao
JOIN observationplaces op
ON ao.latitude = op.latitude
AND ao.longitude = op.longitude
JOIN birds b
ON
ao.animalname = b.animalname

to the following, where all animalobservations are first filtered down to be only birds, before we join in the observationplaces table.

SELECT animalname, observationtime, observationnotes, country, area, habitatinformation
FROM animalobservations ao
JOIN birds b
ON
ao.animalname = b.animalname
JOIN observationplaces op
ON ao.latitude = op.latitude
AND ao.longitude = op.longitude

Again - would be super interesting to get some concise and simple examples to fully bring the message across. I do not think these examples are good enough to make a pull request.

bjornasm avatar Apr 09 '24 10:04 bjornasm