prql icon indicating copy to clipboard operation
prql copied to clipboard

Improve tooling for iteration with `range`, `map`, `fold`, `flat_map`

Open richb-hanover opened this issue 1 year ago • 7 comments

What's up?

I posted the following on Discord:

I have created a PRQL query that does something wonderful using a date. (Exactly what isn't really interesting, just that it's related to some date.)

I would like to create a query that iterates through a number of dates using the query, say with the first day of every quarter:

yyyy-01-01 yyyy-03-01 yyyy-06-01 yyyy-09-01

(or the first of every month, or the first and fifteenth of every month...) And then, of course, I will want to substitute yyyy with many years, 2010 .. 2023 to make the query for multiple years.

IF I was writing a procedural language, I would have a couple nested loops substituting each year for yyyy, then making the (four) separate queries.

I see that PRQL has a loop construct, and I wonder if that might be applicable to this situation. The examples in the Book and the Blog aren't really instructive: they seem "too clever by half", and don't give any guidance for a real-world application that uses data from a table.

So... How could I do this with PRQL? What auxiliary tables might I need to set up?

Many thanks.

@aljazerzen responded:

Your question is a very valid question that should have an easy answer, but I don't have a spare hour to construct it with the tools that PRQL offers today... This use case should be posed to GH issues as "something that we want to implement some day" and also for other people that don't use discord to maybe get a better "ideal scenario" query.

let query_quarters = func year rel -> (
  rel
  filter (
    this.date in [
      @01-01 + year,
      @03-01 + year,
      @06-01 + year,
      @09-01 + year
    ]
  )
  select {this.date, this.value}
)

let main = (
  range(2010, 2024)
  # this is an array of ints, first is 2010, last is 2023

  # convert the into into a relation
  map (year -> query_quarters year my_db.my_table)
  # this is now an array of relations

  # append all relations together
  fold append []
)

Thanks @aljazerzen! I will use this hint as the start of some wool-gathering thoughts to see if I can piece together a query using today's PRQL.

richb-hanover avatar Dec 08 '23 13:12 richb-hanover

A bit laborious but I think it shows that you should have all the tools available to you:

from [{n=1}]
loop (
  filter n<24
  select n = n+1
)
derive {
  y = (2010 + (n-1) // 12 | as int),
  m = ((n-1) % 12) + 1,
  s = f"{y}-{m}-1",
  d = (s | as date)
}

Query result image

snth avatar Dec 08 '23 15:12 snth

@snth, @eitsupi Thanks - I see how this is part of the solution. If I understand correctly, that query creates a table of dates - the first day of the month for a range of years. (I could, of course, create this table by hand, if I didn't mind boring work.)

But this misses the next step: how to "pick off" each of those dates to use in a subsequent query. Specifically, I have a similar query from May that takes a specific date and counts rows based on this date.

How can I create a query that has the effect of iterating through the "table-of-dates", substituting each date into the query below? Thanks.

from reservations        

filter resdate <= "2015-03-01"
aggregate {
  count this
}

PS I see that @aljazerzen hypothesizes (in his comment above) a future map() function that might do this. Is it possible to cobble together current-day PRQL to generate SQL to do this?

richb-hanover avatar Dec 09 '23 13:12 richb-hanover

Yikes! This query might be simpler than I imagined. I asked ChatGPT this:

I have a table (t1) that has a column named "eventDate". Each row has a date for an event that occurred. I have a second table (t2) that has a column named "reportDate". Each row of t2 contains a date. I want to create a query that will use the reportDate's from t2 to count the events from t1 whose eventDate is less than the reportData. What SQL query will do this?

and it gave back this:

SELECT t2.reportDate, COUNT(t1.eventDate) AS EventCount
FROM t2
LEFT JOIN t1 ON t1.eventDate < t2.reportDate
GROUP BY t2.reportDate;

At first blush, this seems to work. I need to convert the query back to PRQL, then check that I can actually generate the table I want, and also compare against the correct numbers I have from a different query.

The big lesson here is that I need to stop thinking procedurally - no fancy "looping" or map() function necessary. I can just let SQL be SQL. Join two big tables - the database doesn't care 'cause it'll get optimized out. (Or maybe it will take a "lot of time". But I only run this query once a month, so that won't matter.)

richb-hanover avatar Dec 10 '23 03:12 richb-hanover

@richb-hanover, I'll hide your second part of the question, so this issue can document the interesting case you've found with the first part

aljazerzen avatar Dec 12 '23 12:12 aljazerzen

I've cleaned up the examples a bit and made them exactly equivalent, so we have a reference when we get around to implement this.

The task is to generate an array or a relation of first dates of year quarters for a range of years.

Currently supported:

let year_from = 2011
let year_to = 2014

let main = (
  from [{n=0}]
  loop (
    select n = n + 1
    filter n < (year_to - year_from) * 4
  )
  derive {
    y = (year_from + n // 4 | as int),
    m = (n % 4) + 1,
    d = 1,
    s = f"{y}-{m}-{d}",
  }
  select (this.s | as date)
)

range, map and fold:

let year_from = 2011
let year_to = 2014

let main = (
  (range year_from year_to)
  # this is an array of ints

  # convert the year into four dates
  map (year -> [
    @0000-01-01 + year,
    @0000-03-01 + year,
    @0000-06-01 + year,
    @0000-09-01 + year,
  ])
  # this is now an array of arrays of dates
  
  # append arrays to one another to produce a single array
  fold append []
)

range, flat_map:

let year_from = 2011
let year_to = 2014

let main = (
  (range year_from year_to)
  # this is an array of ints

  # convert the year into four dates
  flat_map (year -> [
    @0000-01-01 + year,
    @0000-03-01 + year,
    @0000-06-01 + year,
    @0000-09-01 + year,
  ])
)

aljazerzen avatar Dec 12 '23 12:12 aljazerzen

Thanks! And once I have a chance to verify the results of the second part of my question, I'll document it in a Discussion.

richb-hanover avatar Dec 12 '23 13:12 richb-hanover

Excellent examples @aljazerzen !


Some prior art of generating arrays (which can easily be unnested into columns): generate array & generate date array.


While the examples above are impressive, I would generally pause before focusing on arbitrary iteration over values — including things like fold. The atomic unit in SQL is generally a column (as https://github.com/PRQL/prql/issues/3782 somewhat showed, select is kinda equivalent to an elementwise map). And while we can make some headway at handling individual values — at some point folks should use a general purpose language.

So I would vote to focus on marginal improvements in this area, taking inspiration from what already exists in SQL, and deprioritize targeting full value-level functional constructs...

max-sixty avatar Dec 18 '23 20:12 max-sixty