prql
prql copied to clipboard
Type checking
TODO:
- [ ] basic tests
- [ ] tests for function curries and pipelines
- [ ] replace join
side:
argument with tonulls_left:
andnulls_right:
- [x] figure out how to reference tables
- [x] move transform casting after type and name resolving
TODO:
- [x] name resolving for double references in join (JOIN USING must reference two tables)
- [x] fix
wrap_into_window
- [x] clear scope at select & aggregate
This comment will be me thinking out loud on how to internally structure the compiler.
In my previous big commits, I've added declared_at
field to the Node struct. It holds an integer that serves as an id of the declaration this variable references. Declarations are stored in context.declarations
, which is an array indexed by the id. Declarations can be either expressions (most common), extern_ref (columns from actual tables), table (for table names) or functions.
Because SQL has interesting rules on which variables can contain aliases and which not (i.e. WHERE cannot and ORDER BY must), each variable has to be able to be materialized both as an alias or as an expression, depending on the materialization location.
This is why the name resolver (first stage of semantic analysis) moves all expressions from AST to context.declarations
and leaves a simple ident in AST. Even unnamed columns leave Ident("<unnamed>")
behind. This way, when materializing, one can pull declarations from one place only.
Now, I've split the resolver into 3 stages: name resolver, type checker and transform constructor. This way, we can treat (and type check) transforms just as ordinary functions. The problem is that now they must exactly adhere to rules for ordinary functions. This becomes problematic for:
-
join side:left
, whereleft
would be resolved as a normal variable (which would reference a column). I've solved this by adding type<unresolved>
, which prevents this parameter to be resolved -
join employees
, whereemployees
should be resolved in namespace of tables. I've solved this by resolving all<table>
types in their own namespace - Name resolving depends on transforms - for example
select
andaggregate
clear the scope of the frame columns. But functions are identified as transforms only in stage 3. This cannot be done before, because they may be missing parameters due to currying and such. I currently don't have solution to this. - when moving expressions into
context.declarations
, they lose information about location of their declaration. This is needed to wrap functions that return a column within select or derive intoWindowed
. I don't have a solution to this.
One approach that would solve 4. would be to keep all expressions in AST and resolve them as soon as possible - maybe during type checking. This would also simplify some parts of the code base, because there would be no need to traverse both AST and declarations anymore.
All that said, I'll go and sleep on it before going down into another major refactor.
@max-sixty, is this a valid query?
derive a = 5
If we want to keep promises of transforms I'd say it is not, because derive
only adds a column to a table. And since there is no from
, derive
is missing its last argument, which means that it has a type of func frame -> frame
and thus cannot be converted to SQL query.
But in such case we should provide a syntax of table literals: #286
@max-sixty, is this a valid query?
derive a = 5
I could see two options:
- We start with a null tablespace / namespace, and can add columns like that; e.g. an implicit
from null
(which gets cleared byfrom foo
? Or maybederive
has an alternative meaning when it doesn't follow from afrom
?) - We use #286 to do this
I do think derive a = 5
is quite a natural, cohesive way of creating a column from nothing. It's also consistent with SQL's SELECT 5 as a
. But I'd be fine to give this up if that makes the overall thing more cohesive. We can also try it and iterate — we're still early enough to do that I think.
My concern is that derive only adds columns not rows.
from null # empty table, 0 column, 0 rows?
derive a = 5 # 1 column, 0 rows?
If derive has special behavior without last parameter, we need default values for positional arguments (which breaks other things).
So I'd stick with option 2.
My concern is that derive only adds columns not rows.
Yes good point. Agree with option 2!
Meta point — I think it's good that we're starting to hit these issues and it's good that we're resolving them in a conceptually correct way, even if that has small backward incompatible changes, or deviations from SQL. It's forcing us to confront the SQL warts we didn't even know about!
I agree very much on that - one should be able to scan over half of the transforms then be able to read PRQL and expect what is a result of a query.
70 tests pass, 17 fail.
@max-sixty now, the semantic branch should be quite stable in the sense most of my major changes have been applied. I still have plans to change-up materializer (which will also be renamed into anchorer or something). I also have plans for IR to ditch Declarations, so after that I hope it will be easier to debug.
And I also disabled duckdb integration tests - I got tired of waiting for duckdb to compile. So maybe rather base from commit d3bc7e0
if you don't want that.
@max-sixty now, the semantic branch should be quite stable in the sense most of my major changes have been applied. I still have plans to change-up materializer (which will also be renamed into anchorer or something). I also have plans for IR to ditch Declarations, so after that I hope it will be easier to debug.
Awesome, I will switch over to it.
And I also disabled duckdb integration tests - I got tired of waiting for duckdb to compile. So maybe rather base from commit d3bc7e0 if you don't want that.
In my core iteration loop (at the bottom of https://github.com/prql/prql/blob/main/CONTRIBUTING.md), I include -p prql-compiler --lib
, which only runs unit tests. Does that help? (ofc no stress about disabling, only suggesting as it may be easier and mean you don't have to diverge from main)
I was doing that a lot, but cargo build
and rust-analyzer still had long compilation times. For some reason compiling duckdb too way too long and sometimes used >5GB of RAM. Not sure why.
So I just keep rebasing the last commit to the top of my chain so I can remove it before the merge. That's why I gave you the notice, so you will not include it in your commit chain.
I was doing that a lot, but
cargo build
and rust-analyzer still had long compilation times. For some reason compiling duckdb too way too long and sometimes used >5GB of RAM. Not sure why.
FYI you're correct that it does compile all dev-dependencies when running tests. But on my system I can't seem to make it recompile once it's been compiled once. If you can reproduce, I'm very happy to either fix it or move the integration tests back into a separate crate.
As discussed, this branch (and this PR) will now turn into a feature branch that we merge into. The plan is to finish the work that I started and to onboard @max-sixty, @priithaamer and anyone who also wants to learn our codebase.
I promise not to rebase this branch anymore. We will also refrain from merging changes from the main branch, so it will be easier to rebase/squash back to the main branch at the end.
Currently, there are many tests are failing: some produce errors/panics and some don't match the snapshots. Make sure not to accept any failing insta snapshots, because they (mostly) fail for a reason and we need to fix the compiler not the test. So we will be merging commits that don't pass CI tests, but please run cargo clippy
and cargo fmt
before merging.
There are some code blocks commented out, mostly dealing with windowing. Don't delete them, we will still need that code.
To get started, I suggest reading ARCHITECTURE.md (on this branch, I updated it a little) and the two book pages I wrote on the book-semantics
branch.
A few quick hints:
- we have 2 different
Expr
structs and 2 differentExprKind
enums. One inast
and one inir
modules. - we have 2 different
Context
structs. One for semantics context and one for SQL context. There is alsoAnchorContext
within the SQL context. At this point I use the word context for some container that holds additional information along with AST. It's hard to name things. - we have 3 types of ids: semantic ids (usually named just
id
ordeclared_at
),ir::CId
(column id) andir::TId
(table id). - AST means Abstarct Syntax Tree and we have two of them. One is confusingly named
ast
because I cannot find a better name. - if duckdb is taking too long to compile, cherrypick a commit on branch
aljaz
that disables it. Just make sure not to merge it to thesemantic
branch.
I'll be adding onboarding
tag to issues that have prepared task for this branch. I suggest setting assignee to avoid multiple people from working on the same thing.
@aljazerzen how do you want to handle merging main
in here? I'm guessing merge difficulty is super-linear in size — doing lots of small merges is easier than one big merge.
Doing https://github.com/prql/prql/issues/1048 would be useful for this, so "all existing tests pass" can be a marker for the merge being successful.
I don't think we need to merge main in here. I something there that you need? If we don't merge it will be easier to rebase/squash in the end when we are done with semantic.
I did the full merge and ignored failing tests.
Let's try and keep the changes here as scoped as possible — I appreciate that the place we're working is an easy place to make a quick change. But it does make merges harder, and it makes reviews much harder. Some changes do need a huge refactor, but some changes don't — anything like editing a Readme / a .gitignore
/ changing names of structs like Interval
— let's try and not do those here, since then we get conflicts with any changes in main
in those areas.
...unless we decide to literally abandon main
, but I'm not sure we're there yet.
Thank you!
If we don't merge it will be easier to rebase/squash in the end when we are done with semantic.
Sorry @aljazerzen — I just saw this and the message above, and as you can see I did the merge yesterday.
But how is that the case that it would be easier to merge at the end than as we go? My claim is that it's much harder to merge bigger conflicts than lots of small ones.
Unfortunately I also may have made this mistake on not seeing your message in time:
Make sure not to accept any failing insta snapshots, because they (mostly) fail for a reason and we need to fix the compiler not the test.
I can probably revert these — do you think that would be worthwhile?
My concern is that derive only adds columns not rows.
FWIW, derive & select can already add rows when doing stuff such as s"json_array_elements()"
@max-sixty @MarinPostma I'm force pushing semantic-rebased
to semantic
. You'll have to hard reset your local branches and rebase/cherrypick your unpushed commits onto the new commits.
My concern is that derive only adds columns not rows.
FWIW, derive & select can already add rows when doing stuff such as
s"json_array_elements()"
@mklopets Wait what? How can json_array_elements()
be used in select
?
My concern is that derive only adds columns not rows.
FWIW, derive & select can already add rows when doing stuff such as
s"json_array_elements()"
@mklopets Wait what? How can
json_array_elements()
be used inselect
?
@aljazerzen sorry, I missed your reply! Here's a postgres example... In case that was a source of confusion, this does pass json_array_elements an argument, which I omitted from my prev comment for simplicity.
select json_array_elements(arr), something from (
select '[1,2,3,4]'::json as arr, 'test' as something
) data;
The subquery returns just one row:
arr | something |
---|---|
[1,2,3,4] | test |
Adding json_array_elements
turns 1 row into 4 in this case – this is the full query's result:
json_array_elements | something |
---|---|
1 | test |
2 | test |
3 | test |
4 | test |
And, through S-strings, this could be done with both select
and derive
operations.
@mklopets Adding json_array_elements turns 1 row into 4 in this case – this is the full query's result:
I see. It has happened a few times now that I started thinking that I understand SQL in-depth. But things like this then surprise me time and again! It gives an impression that I don't understand the underlying model that SQL is working with.
In this case, the "range function" json_array_elements
-- which is normally used in place of a range expression (in FROM or JOIN) -- is used in SELECT projection. This creates a column from the result of the function call and broadcasts values of other columns. I have not seen this behavior before.
The good news is that this is doable only with using s-strings and would not fit in our stdlib.