bustub icon indicating copy to clipboard operation
bustub copied to clipboard

RFC: remove distinct executor

Open skyzh opened this issue 1 year ago • 1 comments

From my (not-so-long) experience, few database systems would have a separate distinct executor. Such queries can be easily planned into a HashAgg. For example,

https://github.com/cmu-db/bustub/blob/8ec6039d3c7129c8b7e8209b6a7fba2dc2c4dddf/test/execution/executor_test.cpp#L620-L621

Can be planned like:

HashAgg group_by=col7
  TableScan

DuckDB will make it a HashAgg.

https://github.com/duckdb/duckdb/blob/1f63c2429ff9fe246e664e72fe45641ee331c759/src/execution/physical_plan/plan_distinct.cpp#L66-L72

Postgres:

**Schema (PostgreSQL v10)**

    create table test_table (v1 int);

---

**Query #1**

    explain select distinct v1 from test_table;

| QUERY PLAN                                                         |
| ------------------------------------------------------------------ |
| HashAggregate  (cost=41.88..43.88 rows=200 width=4)                |
|   Group Key: v1                                                    |
|   ->  Seq Scan on test_table  (cost=0.00..35.50 rows=2550 width=4) |

---

[View on DB Fiddle](https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/0)

Maybe we can remove distinct executor from bustub, and refactor executor test to plan distinct as HashAgg.

Welcome to comment!

skyzh avatar Jul 28 '22 15:07 skyzh

This seems reasonable to me. I think we also did something like this in the research codebases (NoisePage, Peloton); the distinct executor is probably BusTub specific / bolted-on. I think the only "disadvantage" of your change is that it might become conceptually more complicated for a student to implement, but I think it is fine + as you said, nobody really implements distinct like this.

I would check that we have working HashAgg (or infrastructure to implement a working HashAgg, if this is a student project) before making the change though.

lmwnshn avatar Jul 28 '22 21:07 lmwnshn