prql
prql copied to clipboard
docs: website article
pre-commit.ci autofix
pre-commit.ci autofix
I don't get why this was required — this should be the default. Possibly because it was failing on a commit from itself, and pre-commit doesn't know whether it might be in a loop, so disables that?
Excellent! I'll take a pass later.
Would you prefer this is from "the project" or from you? I notice it doesn't have your name at the moment. I would think it's the sort of piece that could come from an individual (rather than some official project documentation), but no strong view. If we have it from the project, we can adjust some of the language from "I"
(I went through a phase of doing that through PRQL's docs as PRQL became a proper project, lmk if you still see any remnants!)
Would you prefer this is from "the project" or from you?
Actually this can remain as coming from me - a few times I speak in first person and give some personal opinions may not apply to the whole prql team.
pre-commit.ci autofix
@aljazerzen I just found this, have a look! Functional Programming with SQL Query lambdas, functions and pipelines #6
The whole repo seems to be about this: SQL Functional Programming
I haven't done more than glance at it, mostly because I find the SQL really doesn't lend itself to taking a quick look and getting a good idea of what it's about (a great motivation for PRQL), but it seems very similar in aim and spirit (if not syntax) to PRQL and your article in particular.
Oh my god - that's exactly what I'm talking about. A lot of useful material! Thanks!
From that article:
In SQL-99 every query is, at it's base, hard coded to a global namespace of tables and columns. This means that there is no mechanism to cleanly separate the query transformation from the actual data that is being processed.
For me this is a key point, and from my understanding, still an issue with our current name resolution. See for example @aljazerzen 's comment here where it is said:
... we'd be looking up rel.salary in rel. In cases you provided, no relation had a column rel.salary, so it would be also produce an error.
I believe that creating function scope local table names and then restricting column name resolution to that scope is essential to enabling reusable functions in PRQL.
I also think that this links back to the Monadic structure of our transformation pipelines and the actions of the bind operator, which I unfortunately cannot elucidate properly yet and remains a hunch for me at this stage.
The rest of the Benefits section of that article are basically the same motivations that are shared by PRQL.
Thank you both for all the feedback. It really was too skewed toward functions and not enough toward relations.
I consider this finished, but as always leave comments even after is merged. We can link to it in the next release.
Hi team, nice to know that someone has some value from my reflections on functional programming with SQL! I don't love SQL syntax either - so for me, PRQL is an exciting development.
It is a bit hard to follow the thread above to understand where your function definition/ reference discussions landed. Could someone sketch out how a reference to summarise
and the definition of it would land in PRQL?
Usage
SELECT *
from summarise((SELECT country_id summary_id, state_id id FROM states),
(SELECT state_id id, population amount FROM population_by_states)
);
Definition
CREATE VIEW summarise(summary_id, amount) AS
GIVEN (mapping(summary_id, id), rawdata(id, amount))
SELECT mapping.summary_id, sum(rawdatea.amount) amount
FROM rawdata rawdata
JOIN mapping mapping ON rawdata.id = mapping.id
GROUP BY mapping.summary_id;
)
I'm keen to catch up with the conversation.
David
Hi @DavidPratten ,
Welcome and great to have you part of the conversation!
I did promise to raise a new issue for you, apologies for not getting to that yet. We're trying to tie up a couple of loose ends in order to get the 0.4 release out this weekend. There will be actually a number of changes that impact the "table" (aka CTE) and transform definitions which has bearing on what you are interested in.
Our functions actually work amazingly great for the most part and many things that appear to be something different on the surface, e.g. relations and transforms, are actually functions underneath, following the functional programming paradigm.
What I see as an outstanding issue for us, is exactly as you described so well in the piece that I quoted from your article:
In SQL-99 every query is, at it's base, hard coded to a global namespace of tables and columns. This means that there is no mechanism to cleanly separate the query transformation from the actual data that is being processed.
My understanding is the compiler currently needs to be able to resolve names in the global namespace of tables and columns at function definition time which in some ways leaves us in the same position as SQL. @aljazerzen is really our compiler expert and mage though so he would have to confirm that.
I advise though that we put a temporary hold on this discussion until 0.4 is released so that the public documentation matches the current state of the compiler that the core team are referencing in the discussions. 0.4 should come out soon (this weekend probably) so we can pick things up after that.
I hope you find that acceptable.
P.S. You can read @aljazerzen 's article that this thread is about here: A functional approach to relational queries
@DavidPratten , since we found each other through Torsten Grust's twitter post about "Making Recursive CTEs more lovable", you may also want to take a look at the current discussion around the proposed Recursive CTE syntax for PRQL and weigh in on that.
That discussion can be found here: https://github.com/PRQL/prql/issues/407#issuecomment-1380921570
It is a bit hard to follow the thread above to understand where your function definition/ reference discussions landed. Could someone sketch out how a reference to summarise and the definition of it would land in PRQL?
Yeah, this thread does not give good context. Your function summarize
would look like this:
func summarise mapping rawdata -> (
from rawdata = _param.rawdata
join mapping [==id]
group mapping.summary_id (
aggregate [amount = sum rawdata.amount]
)
)
summarise (
from states | select [country_id, summary_id, state_id]
) (
from population_by_states | select [id = state_id, amount = population]
)
Don't mind the _param
param thing, name resolving is a work in progress.
Thanks @aljazerzen nice. A couple of follow-ups to start with.
Polish notation
summarise (
from states | select [country_id, summary_id, state_id]
) (
from population_by_states | select [id = state_id, amount = population]
)
This looks a bit like Polish Notation where parameters are not grouped by (
)
- is this a pattern used in other parts of PRQL, or just here?
Declaring column names and types
In the PRQL example the column names are being drawn from the underlying table in global name space.
To disconnect the function body from the global namespace and have a function that is "write-once use anywhere" the function declaration needs to include column names and types just like any other relation (table) declaration.
In my article, I ignored types, but if they were included the GIVEN
clause might look this:
GIVEN (mapping(summary_id varchar, id bigint), rawdata(id bigint, amount float))
What is the PRQL equivalent of this?
On invocation are we matching names or ordinal position?
On invocation of the function then there is a choice, match columns by ordinal position or by name. e.g. SQL UNION uses ordinal position match. Alternatively, the matching could be on column name. This would force explicit renaming e.g. id = state_id
of input relations to use the names in the function's column declaration.
David
Polish notation
This is our function call notation that starts with the function name and continues with arguments separated by whitespace. It's explained here and here.
Declaring column names and types
At the moment we don't have functioning type annotations, but they are on the map.
Functions are meant to be "write once, use anywhere" and this is achieved by expecting the input relation to have a certain structure. Right now, this contract expressed implicitly by just using references to the columns, but the plan is to 1) allow adding type annotation to the functions 2) automatically infer the columns (maybe types) from the function body.
On invocation are we matching names or ordinal position?
Relations in PRQL have columns ordered and optionally named. This is reflected in their type signature. For now, we only had to do matching for append
(which is UNION ALL) and we haven't explicitly decided on how to do matching exactly. As a first iteration, I implemented matching based on position, but we will have to revisit that some time soon.