prql
prql copied to clipboard
Many errors when attempting to truncate final runs of the same value
What's up?
Consider a table like:
id val
1 a
2 b
3 a
4 a
5 b
6 b
7 a
8 b
9 b
10 b
There are "runs" of values in the "val" column, i.e. sequences like aa or bbb. What I want to do is remove the final "run", so that the result is like:
id val
1 a
2 b
3 a
4 a
5 b
6 b
7 a
Also in scope is to include first row of the final run but nothing more:
id val
1 a
2 b
3 a
4 a
5 b
6 b
7 a
8 b
I tried to do this in PRQL but everything I try seems to cause into some inscrutable error.
Here is what I have tried. Not all of these attempts may be correct, they just detail stuff I ran into in the process of trying to figure out what to do.
Attempt 1
let my_data = [
{id=1, val="a"},
{id=2, val="b"},
{id=3, val="a"},
{id=4, val="a"},
{id=5, val="b"},
{id=6, val="b"},
{id=7, val="a"},
{id=8, val="b"},
{id=9, val="b"},
{id=10, val="b"},
]
from my_data
sort id
derive {
last_val = lag 1 val,
changed = coalesce (val != last_val) true
}
filter changed
derive {run_id = row_number this}
select {id, run_id}
join side:left my_data (my_data.id == this.id) # fails here
Error:
╭─[:24:25]
│
24 │ join side:left my_data (my_data.id == this.id) # fails here
│ ─────┬────
│ ╰────── Ambiguous name
│
│ Help: could be any of: that.my_data.id, this.my_data.id
────╯
Error: Binder Error: Referenced table "table_2" not found!
Candidate tables: "table_1"
LINE 92: table_2.id
For context, commenting out the join line gives this result:
id run_id
1 1
2 2
3 3
5 4
7 5
8 6
I think this is a bug because the my_data name doesn't look ambiguous to me.
It also fails when I try to use join side:left my_data (==id):
Error: Binder Error: Referenced table "table_2" not found!
Candidate tables: "table_1"
LINE 92: table_2.id
Attempt 2
I've omitted the let my_data = ... definition from subsequent fragments for brevity.
let my_data = [
{id=1, val="a"},
{id=2, val="b"},
{id=3, val="a"},
{id=4, val="a"},
{id=5, val="b"},
{id=6, val="b"},
{id=7, val="a"},
{id=8, val="b"},
{id=9, val="b"},
{id=10, val="b"},
]
let run_ids = (
from my_data
sort id
derive {
last_val = lag 1 val,
changed = coalesce (val != last_val) true
}
filter changed
derive {run_id = row_number this}
select id
)
from my_data
sort id
join side:left run_ids (==id)
filter my_data.id < max(run_ids.id) # fails here
Error:
╭─[:29:25]
│
29 │ filter my_data.id < max(run_ids.id) # fails here
│ ─────┬────
│ ╰────── Unknown name
────╯
It can't resolve the run_ids name even though I defined it in a let statement just before.
Attempt 3
While troubleshooting the above I rewrote it into a trivial let binding and tried to print it:
let run_ids = (
from my_data
sort id
derive {
last_val = lag 1 val,
changed = coalesce (val != last_val) true
}
filter changed
derive {run_id = id}
select run_id
)
from run_ids
Error: Binder Error: Referenced column "id" not found in FROM clause!
Candidate bindings: "run_ids.run_id"
LINE 91: id
However when I remove the let block and have the same fragment in the top level scope, it works:
from my_data
sort id
derive {
last_val = lag 1 val,
changed = coalesce (val != last_val) true
}
filter changed
derive {run_id = id}
select run_id
Attempt 4
In this one I try to use aggregate and group but it doesn't work either:
from my_data
sort id
derive {
last_val = lag 1 val,
changed = coalesce (val != last_val) true
}
group changed (
aggregate {
max_id = max id
}
)
Error: Binder Error: GROUP BY clause cannot contain window functions!
But it seems to me that this should be possible given that before the group function the data looks like this:
id val last_val changed
1 a null true
2 b a true
3 a b true
4 a a false
5 b a true
6 b b false
7 a b true
8 b a true
9 b b false
10 b b false
Surely I should be able to group by the changed column and get the maximum of the id column for true and false?
Attempt 5
Next I simply tried to get the maximum id over the whole table, without any grouping, just as a sanity check, but this also fails:
from my_data
sort id
derive {
last_val = lag 1 val,
changed = coalesce (val != last_val) true
}
aggregate { max id }
Error: Binder Error: column "id" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(id)" if the exact value of "id" is not important.
Attempt 6
Next I tried doing it in the filter function, but I couldn't figure out how to get the expression I wanted. I don't know if this is a bug or that I'm just not understanding PRQL's syntax:
from my_data
sort id
derive {
last_val = lag 1 val,
changed = coalesce (val != last_val) true
}
filter id < (this | filter changed | aggregate { max id })
I'm basically trying to say "filter such that the id is less than the max id among the rows that changed". The docs say that this refers to the current relation so I thought something like this would work, but I'm not sure why it doesn't.
At this point nothing I've tried has worked, and I don't know how to proceed.
Thanks a lot for writing all this down.
I tried some of these — there are some bugs there. For example, Attempt 1 hits https://github.com/PRQL/prql/issues/3129.
Many of these happen because PRQL hasn't encoded what DuckDB can accept — anything that contains "Binder Error" is an error from DuckDB. Often we could write different SQL that wouldn't hit those errors (they're not bugs in DuckDB) — for example in Attempt 4, DuckDB can't accept window functions in a group by, but IIUC we could materialize the result in a CTE and then use that.
I appreciate the dedication in enumerating these, and sorry that PRQL wasn't helpful here. So this is a reminder for me & others that PRQL is still quite difficult to use for some intermediate queries.
That said, for the purposes of improving PRQL, we'd need a couple of things:
- Minimal examples of the bugs
- The SQL that should be generated
Some observations:
Attempt 1
The error in the join can be resolved by using that instead of my_data but then you hit a bug with a reference in the ORDER BY to table_2.id which is a duplicate of #3619 .
The following compiles and runs but I haven't checked whether semantically it gives you what you want @ckp95:
from my_data
#sort id
derive {
last_val = lag 1 val,
changed = (val != last_val) ?? true
}
filter changed
derive {run_id = row_number this}
select {id, run_id}
join side:left my_data (that.id == this.id) # works with `that`