bustub
bustub copied to clipboard
RFC: remove distinct executor
From my (not-so-long) experience, few database systems would have a separate distinct executor. Such queries can be easily planned into a HashAgg. For example,
https://github.com/cmu-db/bustub/blob/8ec6039d3c7129c8b7e8209b6a7fba2dc2c4dddf/test/execution/executor_test.cpp#L620-L621
Can be planned like:
HashAgg group_by=col7
TableScan
DuckDB will make it a HashAgg.
https://github.com/duckdb/duckdb/blob/1f63c2429ff9fe246e664e72fe45641ee331c759/src/execution/physical_plan/plan_distinct.cpp#L66-L72
Postgres:
**Schema (PostgreSQL v10)**
create table test_table (v1 int);
---
**Query #1**
explain select distinct v1 from test_table;
| QUERY PLAN |
| ------------------------------------------------------------------ |
| HashAggregate (cost=41.88..43.88 rows=200 width=4) |
| Group Key: v1 |
| -> Seq Scan on test_table (cost=0.00..35.50 rows=2550 width=4) |
---
[View on DB Fiddle](https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/0)
Maybe we can remove distinct executor from bustub, and refactor executor test to plan distinct as HashAgg.
Welcome to comment!
This seems reasonable to me. I think we also did something like this in the research codebases (NoisePage, Peloton); the distinct executor is probably BusTub specific / bolted-on. I think the only "disadvantage" of your change is that it might become conceptually more complicated for a student to implement, but I think it is fine + as you said, nobody really implements distinct like this.
I would check that we have working HashAgg (or infrastructure to implement a working HashAgg, if this is a student project) before making the change though.