prql icon indicating copy to clipboard operation
prql copied to clipboard

Relative names

Open aljazerzen opened this issue 2 years ago • 21 comments

Abstract: I propose to change references to columns from column to .column.

Reasoning: I'll try to explain how resolver works and how I think about semantics of name and variables in PRQL.

During resolving, there is a major distinction between scoped and ephemeral variables:

  • Scoped variables have a definition and live until their scope exists. For example, std.sum and std.select are global so they exist indefinitely, and function parameters exist only within function body.
  • Ephemeral variables are just references into some other argument of a current function call. For example, when you call select, all columns of the relation exist as variables during resolution of the first argument.

It is beneficial to distinguish these two mechanism, because of their subtle differences. For example take this query:

func my_transform rel -> (
    rel
    select [alb.title, artist_id]
)

from alb = albums
my_transfrom

Here, relation is constructed with from and within the relation a name alb is assigned all column from table albums. Note that alb is not a "real" value, it's just a namespace for the columns. When this relation is passed to my_transform, it is stored in the rel parameter. rel is now a scoped variable while alb.title is a reference to one of its columns.

I'm not sure if I've explained that well, please tell me if I haven't.

If I compare this behavior with, say, Python and a dataframe library, scoped variables are all normal idents, while ephemeral variables would be represented with strings. This is a bit more verbose and cannot provide good errors, typing or autocomplete. (This is feature of PRQL that dataframe libraries cannot copy. Only a custom language for relations can construct custom rules for name resolution.)

So because there is distinction in resolving, I suggest we add a distinction in syntax:

func my_transform rel -> (
    rel
    select [.alb.title, .artist_id]
)

from alb = albums
my_transfrom
sort .title

Pros:

  • distinction in syntax hints to the distinction in resolving
  • for newcomers, the rule is simple: columns start with a dot

Cons:

  • additional syntax we could be without

aljazerzen avatar Jan 24 '23 16:01 aljazerzen

I quite like the idea of a leading . for columns. I don't really know why yet but it feels like it would bring additional consistency. It also reminds me of JDOT (https://github.com/saulpw/jdot).

TBH, I did not understand the name resolution explanation yet but I will try again in the morning (it's close to midnight now). For example why is it alb.title in my_transform initially and not rel.title? And with the new syntax, why is it still .alb.title and not just .title (or if the alb is required then alb.title)?

Another possible benefit could be that it might disambiguate a column named "from" from the keyword from since the column would be referred to as .from. (IDK if this is currently a problem for the parser/compiler.)

snth avatar Jan 24 '23 21:01 snth

It also reminds me of JDOT (https://github.com/saulpw/jdot).

Perhaps the origin is jq? https://stedolan.github.io/jq/

I think jq is a very popular language for writing queries to json.

eitsupi avatar Jan 28 '23 14:01 eitsupi

Sorry to take a while to respond.

I think I'm understanding 85% of this,so forgive me if I'm slow.

I can see two points here;

  • discriminate between scoped and emphermal variables
  • use .foo for some variables

Re the discriminaring — how easy do you think it is to explain when to use a period vs. not to? I worry it's not easy! (But possibly we could make it easier).

Re the periods — I don't have a strong secular objection to it. It would be a big change, and I'm not sure it gets us that much apart from the discrimination. But it is an effective way of allowing columns to be clearly different from functions.


To what extent do you think it's accurate to describe emphermal variables as just having a scope that's limited to that line?

max-sixty avatar Feb 03 '23 12:02 max-sixty

It just one point here: use .foo for ephemeral variables.

The rule for when to use the dot is simple: columns start with a dot.


describe emphermal variables as just having a scope that's limited to that line?

That's pretty accurate. But it may be confusing because even though the scope is limited to current function, almost identical scope could be created for next function in the pipeline.

aljazerzen avatar Feb 03 '23 13:02 aljazerzen

It just one point here: use .foo for ephemeral variables.

Totally, but is there an easy way to define ephemeral variables to beginners?

max-sixty avatar Feb 03 '23 20:02 max-sixty

I'm saying that for beginners, ephemeral variables can be equivalent to columns. So the whole rule is columns start with a dot. And we don't even mention ephemeral variables.

That's because we don't have anything other than relations that we'd want to have references into. Maybe in the future, we could add support for referencing properties of JSON objects or structs.

aljazerzen avatar Feb 04 '23 09:02 aljazerzen

Yes OK, that is complete in the examples above.

How about when it's a variable; for example:

func add a b -> a + b
# or
func add a b -> .a + .b
# or
func add .a .b -> .a + .b

Thanks for bearing with me...

max-sixty avatar Feb 04 '23 09:02 max-sixty

Oh, params are scoped variables so they don't need a leading dot. So like this:

func add a b -> a + b

func latest n rel -> (rel | sort [-.changed_at] | take n)

# rel and n are params -> scoped -> no dot
# .changed_at is a column (reference "into" rel) -> ephemeral variable -> dot 

aljazerzen avatar Feb 04 '23 15:02 aljazerzen

OK great, I see, thanks.

I think it's tractable. I don't think it's that friendly, and it's much more alien for those who are used to SQL.

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]


I think this is insightful, and maybe we should discuss it more in our docs...

If I compare this behavior with, say, Python and a dataframe library, scoped variables are all normal idents, while ephemeral variables would be represented with strings. This is a bit more verbose and cannot provide good errors, typing or autocomplete. (This is feature of PRQL that dataframe libraries cannot copy. Only a custom language for relations can construct custom rules for name resolution.)

....I've heard this referred to as "bare words". I find it a great advantage of PRQL over something like python. It makes sense that we promote columns to not require quotes, since columns are so important in tabular data; they're almost like variables to us.

As @eitsupi points out, jq uses the .foo syntax, and that's worked well, though they use it all the way down the hierachy; i.e. .alb, never just alb.


So my current view is:

  • Has some nice properties
  • Concern about friendliness / alien-ness (but shouldn't be weighed highly unless this is a consensus view)
  • Concern about hierarchies

How important do you think it is for the development of the lang? Can we instead have a hierarchy of scopes (like many langs do), and resolve ephemeral variables first, and scoped variable after that?

max-sixty avatar Feb 04 '23 22:02 max-sixty

I recall that in dplyr, it is sometimes difficult to distinguish between variables outside the data frame and column names in the data frame, making the behavior confusing.

cyl <- 10

mtcars |>
  dplyr::mutate(new = cyl * 10)

It can be specified explicitly by .data or .env (but many people rarely do this because it increases the amount of writing). https://rlang.r-lib.org/reference/dot-data.html

cyl <- 10

mtcars |>
  dplyr::mutate(new = .data$cyl * 10)

I think it is a good balance of clarity and ease of writing to always start column names with a dot.

eitsupi avatar Feb 05 '23 16:02 eitsupi

I've implemented the proposal and converted the tests in prql-compiler.

Here are a few examples:

from daily_orders
sort .day
group .month (sort .num_orders | window expanding:true (derive rank))
derive [num_orders_last_week = lag 7 .num_orders]
from employees
derive rn = row_number
filter .rn > 2
from employees
derive age = .year_born - s'now()'
select [
    f"Hello my name is {.prefix}{.first_name} {.last_name}",
    f"and I am {.age} years old."
]
from employees
derive count = 12
select [
    twelve = .count,
    aggregated = count,
    aggregated_verbose = std.count,
]

Here is my findings:

  • this syntax is more verbose and less beginner-friendly than what we had before,
  • it simplifies the implementation a bit,
  • in some cases it is less ambiguous (see last example),
  • it would be nice for auto-complete, since typing . would bring up just columns for current relation,
  • there is a bit of inconsistency where we derive new names without the dot, but reference them with the dot,
  • we can now use .* to refer to all columns of the relation, where before we could not use * (since that would be parsed as multiplication).

Possible alternatives:

  • the leading dot is not required, but just encouraged,
  • the special leading dot syntax is replaced with a full name, like rel.first_name instead of just .first_name. In this case, rel. prefix would also be optional.

aljazerzen avatar Mar 01 '23 15:03 aljazerzen

Thanks for the list of findings, that's v helpful to anchor around.

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]

Is this still the same for the full path of columns? Or does alb.title work?


I think the .col syntax is fine from a blank slate, but — overall, in the current state I'm fairly strongly -1.

  • It's a very large change
  • The benefits don't seem that high. I do weigh compiler simplicity highly, since it lets us move faster with a wider group of contributors. But how great a simplification is it / do we think it would let us do much more much faster? (I might be underweighing the extent of the simplification)
  • There are some quite sharp corners IMO — the violation of the hierarchy as above, and the the lack of coherence between lvalues and rvalues ("there is a bit of inconsistency where we derive new names without the dot, but reference them with the dot,"). I think these could be confusing for newcomers.
    • An example of this in jq, which uses dots, but is consistent across these

One lens to view this is what we'd write in the Changelog — I'm not sure what we'd write that I'd feel great about...

max-sixty avatar Mar 01 '23 20:03 max-sixty

Do others share a concern that represents hierarchies inconsistently? For example alb is a relation. But to go into that hierarchy involves adding a period at its start; i.e. .alb.title. Generally to move down a hierarchy we'd only add things onto the end like alb.title or alb["title"]

Is this still the same for the full path of columns? Or does alb.title work?

Actually, this is the confusion that this issue is trying to avoid.

It separates these two cases:

References to things in global scope don't have a leading dot:

let albums = (...)

from albums.title
# `from column` does not make sense, focus on name resolution

References into subject of the current pipeline have a leading dot:

from albums
select .albums.title

So if you are able to refer to albums, you are still able to refer to albums.title.

aljazerzen avatar Mar 02 '23 11:03 aljazerzen

The implementation complexity hasn't changed enough to weigh into the decision here.

And sharp corners that you mention are intentional - a syntactical spotlight of semantics. So they are actually the main benefit. Think of it as the borrow checker in Rust.

But all that said, this change goes strongly against the concise nature of the language we've been able to maintain.

So my vote is -0.5.

aljazerzen avatar Mar 02 '23 11:03 aljazerzen

Thanks for trying this out @aljazerzen . Reading through your examples in https://github.com/PRQL/prql/issues/1619#issuecomment-1450322102 I'm also struck by how there is this inconsistency between rvalues and the lvalues in derive and aggregate. Would it be possible to add the leading . for lvalues as well? (Not saying we should do this as we seemed to be converging on not going ahead with this proposal, just curious if it would be possible in theory since then we could restore consistency?)

Overall, I'm still unclear on the ephemeral vs scoped variables. I was seeing the .col as a shortcut for _frame.col and as such I thought it made some sense. It is quite different to what we/most people know from other SQL/database type systems but I think one could get used to it. The . is a relatively unobtrusive piece of punctuation so I personally don't feel that it gets in the way that much. I would still be open to it if we wanted to explore it more.

snth avatar Mar 05 '23 15:03 snth

Would it be possible to add the leading . for lvalues as well?

Yes, and it would be quite easy to do actually.


I'll take the liberty to interpret @snth's comment as a vote of +0. Total tally is -1.5, which means that we will not be adding this feature.

We can revisit it when there new features that would work well with this.

aljazerzen avatar Mar 05 '23 18:03 aljazerzen

Great, thanks for the productive discussion and exploration effort.

max-sixty avatar Mar 06 '23 00:03 max-sixty

I've been working with jq recently. They have a take of this, but I think with much easier semantics:

  • All data references use a leading period
  • The "root" namespace is just .
  • Then a column would be .date, or a reference into a struct would be .orders.address

So for example, the case above would be:

-from albums
+from .albums
select .albums.title

I think the from X is almost the only thing that changes from the full examples above — since the discriminant is whether it's referring to data, not the exact scope of the data.

max-sixty avatar May 01 '23 19:05 max-sixty

As discussed on the call, I'm not sure my example was correct — instead .albums.title is already within the .albums scope, and so should be:

  • .title
  • ...or $.albums.title
  • ...or you could allow something like .. to go up a level — ..albums.title
-from albums
+from .albums
-select .albums.title
+select .title

max-sixty avatar Mar 03 '24 20:03 max-sixty

Great work

bayareaunicorn avatar Mar 03 '24 22:03 bayareaunicorn

Reopening as this is under consideration again. Possibly we start a new issue synthesizing where we're at, given the amount of history though.

max-sixty avatar Mar 04 '24 20:03 max-sixty