prql Type checking

TODO:

[ ] basic tests
[ ] tests for function curries and pipelines
[ ] replace join side: argument with to nulls_left: and nulls_right:
[x] figure out how to reference tables
[x] move transform casting after type and name resolving

Jun 10 '22 14:06 aljazerzen

TODO:

[x] name resolving for double references in join (JOIN USING must reference two tables)
[x] fix wrap_into_window
[x] clear scope at select & aggregate

Jun 19 '22 18:06 aljazerzen

This comment will be me thinking out loud on how to internally structure the compiler.

In my previous big commits, I've added declared_at field to the Node struct. It holds an integer that serves as an id of the declaration this variable references. Declarations are stored in context.declarations, which is an array indexed by the id. Declarations can be either expressions (most common), extern_ref (columns from actual tables), table (for table names) or functions.

Because SQL has interesting rules on which variables can contain aliases and which not (i.e. WHERE cannot and ORDER BY must), each variable has to be able to be materialized both as an alias or as an expression, depending on the materialization location.

This is why the name resolver (first stage of semantic analysis) moves all expressions from AST to context.declarations and leaves a simple ident in AST. Even unnamed columns leave Ident("<unnamed>") behind. This way, when materializing, one can pull declarations from one place only.

Now, I've split the resolver into 3 stages: name resolver, type checker and transform constructor. This way, we can treat (and type check) transforms just as ordinary functions. The problem is that now they must exactly adhere to rules for ordinary functions. This becomes problematic for:

join side:left, where left would be resolved as a normal variable (which would reference a column). I've solved this by adding type <unresolved>, which prevents this parameter to be resolved
join employees, where employees should be resolved in namespace of tables. I've solved this by resolving all <table> types in their own namespace
Name resolving depends on transforms - for example select and aggregate clear the scope of the frame columns. But functions are identified as transforms only in stage 3. This cannot be done before, because they may be missing parameters due to currying and such. I currently don't have solution to this.
when moving expressions into context.declarations, they lose information about location of their declaration. This is needed to wrap functions that return a column within select or derive into Windowed. I don't have a solution to this.

One approach that would solve 4. would be to keep all expressions in AST and resolve them as soon as possible - maybe during type checking. This would also simplify some parts of the code base, because there would be no need to traverse both AST and declarations anymore.

All that said, I'll go and sleep on it before going down into another major refactor.

Jun 20 '22 21:06 aljazerzen

@max-sixty, is this a valid query?

derive a = 5

If we want to keep promises of transforms I'd say it is not, because derive only adds a column to a table. And since there is no from, derive is missing its last argument, which means that it has a type of func frame -> frame and thus cannot be converted to SQL query.

But in such case we should provide a syntax of table literals: #286

Aug 09 '22 11:08 aljazerzen

@max-sixty, is this a valid query?
derive a = 5

I could see two options:

We start with a null tablespace / namespace, and can add columns like that; e.g. an implicit from null (which gets cleared by from foo? Or maybe derive has an alternative meaning when it doesn't follow from a from?)
We use #286 to do this

I do think derive a = 5 is quite a natural, cohesive way of creating a column from nothing. It's also consistent with SQL's SELECT 5 as a. But I'd be fine to give this up if that makes the overall thing more cohesive. We can also try it and iterate — we're still early enough to do that I think.

Aug 09 '22 18:08 max-sixty

My concern is that derive only adds columns not rows.

from null    # empty table, 0 column, 0 rows?
derive a = 5 # 1 column, 0 rows?

If derive has special behavior without last parameter, we need default values for positional arguments (which breaks other things).

So I'd stick with option 2.

Aug 10 '22 09:08 aljazerzen

My concern is that derive only adds columns not rows.

Yes good point. Agree with option 2!

Meta point — I think it's good that we're starting to hit these issues and it's good that we're resolving them in a conceptually correct way, even if that has small backward incompatible changes, or deviations from SQL. It's forcing us to confront the SQL warts we didn't even know about!

Aug 10 '22 21:08 max-sixty

I agree very much on that - one should be able to scan over half of the transforms then be able to read PRQL and expect what is a result of a query.

Aug 11 '22 07:08 aljazerzen

70 tests pass, 17 fail.

@max-sixty now, the semantic branch should be quite stable in the sense most of my major changes have been applied. I still have plans to change-up materializer (which will also be renamed into anchorer or something). I also have plans for IR to ditch Declarations, so after that I hope it will be easier to debug.

And I also disabled duckdb integration tests - I got tired of waiting for duckdb to compile. So maybe rather base from commit d3bc7e0 if you don't want that.

Oct 04 '22 14:10 aljazerzen

@max-sixty now, the semantic branch should be quite stable in the sense most of my major changes have been applied. I still have plans to change-up materializer (which will also be renamed into anchorer or something). I also have plans for IR to ditch Declarations, so after that I hope it will be easier to debug.

Awesome, I will switch over to it.

And I also disabled duckdb integration tests - I got tired of waiting for duckdb to compile. So maybe rather base from commit d3bc7e0 if you don't want that.

In my core iteration loop (at the bottom of https://github.com/prql/prql/blob/main/CONTRIBUTING.md), I include -p prql-compiler --lib, which only runs unit tests. Does that help? (ofc no stress about disabling, only suggesting as it may be easier and mean you don't have to diverge from main)

Oct 04 '22 18:10 max-sixty

I was doing that a lot, but cargo build and rust-analyzer still had long compilation times. For some reason compiling duckdb too way too long and sometimes used >5GB of RAM. Not sure why.

So I just keep rebasing the last commit to the top of my chain so I can remove it before the merge. That's why I gave you the notice, so you will not include it in your commit chain.

Oct 05 '22 10:10 aljazerzen

I was doing that a lot, but cargo build and rust-analyzer still had long compilation times. For some reason compiling duckdb too way too long and sometimes used >5GB of RAM. Not sure why.

FYI you're correct that it does compile all dev-dependencies when running tests. But on my system I can't seem to make it recompile once it's been compiled once. If you can reproduce, I'm very happy to either fix it or move the integration tests back into a separate crate.

Oct 10 '22 18:10 max-sixty

As discussed, this branch (and this PR) will now turn into a feature branch that we merge into. The plan is to finish the work that I started and to onboard @max-sixty, @priithaamer and anyone who also wants to learn our codebase.

I promise not to rebase this branch anymore. We will also refrain from merging changes from the main branch, so it will be easier to rebase/squash back to the main branch at the end.

Currently, there are many tests are failing: some produce errors/panics and some don't match the snapshots. Make sure not to accept any failing insta snapshots, because they (mostly) fail for a reason and we need to fix the compiler not the test. So we will be merging commits that don't pass CI tests, but please run cargo clippy and cargo fmt before merging.

There are some code blocks commented out, mostly dealing with windowing. Don't delete them, we will still need that code.

To get started, I suggest reading ARCHITECTURE.md (on this branch, I updated it a little) and the two book pages I wrote on the book-semantics branch.

A few quick hints:

we have 2 different Expr structs and 2 different ExprKind enums. One in ast and one in ir modules.
we have 2 different Context structs. One for semantics context and one for SQL context. There is also AnchorContext within the SQL context. At this point I use the word context for some container that holds additional information along with AST. It's hard to name things.
we have 3 types of ids: semantic ids (usually named just id or declared_at), ir::CId (column id) and ir::TId (table id).
AST means Abstarct Syntax Tree and we have two of them. One is confusingly named ast because I cannot find a better name.
if duckdb is taking too long to compile, cherrypick a commit on branch aljaz that disables it. Just make sure not to merge it to the semantic branch.

I'll be adding onboarding tag to issues that have prepared task for this branch. I suggest setting assignee to avoid multiple people from working on the same thing.

Oct 18 '22 09:10 aljazerzen

@aljazerzen how do you want to handle merging main in here? I'm guessing merge difficulty is super-linear in size — doing lots of small merges is easier than one big merge.

Doing https://github.com/prql/prql/issues/1048 would be useful for this, so "all existing tests pass" can be a marker for the merge being successful.

Oct 18 '22 20:10 max-sixty

I don't think we need to merge main in here. I something there that you need? If we don't merge it will be easier to rebase/squash in the end when we are done with semantic.

Oct 18 '22 20:10 aljazerzen

I did the full merge and ignored failing tests.

Let's try and keep the changes here as scoped as possible — I appreciate that the place we're working is an easy place to make a quick change. But it does make merges harder, and it makes reviews much harder. Some changes do need a huge refactor, but some changes don't — anything like editing a Readme / a .gitignore / changing names of structs like Interval — let's try and not do those here, since then we get conflicts with any changes in main in those areas.

...unless we decide to literally abandon main, but I'm not sure we're there yet.

Thank you!

Oct 19 '22 06:10 max-sixty

If we don't merge it will be easier to rebase/squash in the end when we are done with semantic.

Sorry @aljazerzen — I just saw this and the message above, and as you can see I did the merge yesterday.

But how is that the case that it would be easier to merge at the end than as we go? My claim is that it's much harder to merge bigger conflicts than lots of small ones.

Oct 19 '22 16:10 max-sixty

Unfortunately I also may have made this mistake on not seeing your message in time:

Make sure not to accept any failing insta snapshots, because they (mostly) fail for a reason and we need to fix the compiler not the test.

I can probably revert these — do you think that would be worthwhile?

Oct 19 '22 17:10 max-sixty

My concern is that derive only adds columns not rows.

FWIW, derive & select can already add rows when doing stuff such as s"json_array_elements()"

Oct 23 '22 14:10 mklopets

@max-sixty @MarinPostma I'm force pushing semantic-rebased to semantic. You'll have to hard reset your local branches and rebase/cherrypick your unpushed commits onto the new commits.

Oct 25 '22 19:10 aljazerzen

My concern is that derive only adds columns not rows.

FWIW, derive & select can already add rows when doing stuff such as s"json_array_elements()"

@mklopets Wait what? How can json_array_elements() be used in select?

Oct 25 '22 19:10 aljazerzen

My concern is that derive only adds columns not rows.

FWIW, derive & select can already add rows when doing stuff such as s"json_array_elements()"

@mklopets Wait what? How can json_array_elements() be used in select?

@aljazerzen sorry, I missed your reply! Here's a postgres example... In case that was a source of confusion, this does pass json_array_elements an argument, which I omitted from my prev comment for simplicity.

select json_array_elements(arr), something from (
	select '[1,2,3,4]'::json as arr, 'test' as something
) data;

The subquery returns just one row:

arr	something
[1,2,3,4]	test

Adding json_array_elements turns 1 row into 4 in this case – this is the full query's result:

json_array_elements	something
1	test
2	test
3	test
4	test

And, through S-strings, this could be done with both select and derive operations.

Nov 08 '22 22:11 mklopets

@mklopets Adding json_array_elements turns 1 row into 4 in this case – this is the full query's result:

I see. It has happened a few times now that I started thinking that I understand SQL in-depth. But things like this then surprise me time and again! It gives an impression that I don't understand the underlying model that SQL is working with.

In this case, the "range function" json_array_elements -- which is normally used in place of a range expression (in FROM or JOIN) -- is used in SELECT projection. This creates a column from the result of the function call and broadcasts values of other columns. I have not seen this behavior before.

The good news is that this is doable only with using s-strings and would not fit in our stdlib.

Nov 09 '22 09:11 aljazerzen

prql prql copied to clipboard

Type checking

prql
prql copied to clipboard