prql icon indicating copy to clipboard operation
prql copied to clipboard

issue when using a DuckDb Macro

Open maelp opened this issue 1 year ago • 21 comments
trafficstars

What happened?

I have a macro defined in DuckDb to access Parquet files on s3

CREATE MACRO cse(year, weekA, weekB) AS TABLE SELECT * FROM read_parquet(list_transform(
            generate_series(weekA, weekB),
            week -> format('s3://${BUCKET_PATH}/year={0}/week={1}/0.parquet', year, week)
        ));

in SQL I can do

SELECT count(*) FROM cse(2024, 1, 5);

but if I try to compile the following PRQL

from cse(2024,1,5)
aggregate {
   count this
}

it gives an error

[{"kind":"Error","code":null,"reason":"unexpected , while parsing function call","hints":[],"span":"1:13-14","display":"Error: \n   ╭─[:1:14]\n   │\n 1 │ from cse(2024, 1, 5)\n   │              ┬  \n   │              ╰── unexpected , while parsing function call\n───╯\n","location":{"start":[0,13],"end":[0,14]}}]}

PRQL input

see above

SQL output

See above

Expected SQL output

No response

MVCE confirmation

  • [X] Minimal example
  • [X] New issue

Anything else?

No response

maelp avatar Jan 30 '24 14:01 maelp

Right now I seem to be "forced" to use an s-string which is not very neat, eg

from s"SELECT * FROM cse(2024, 1, 4)"
aggregate {
   count this
}

or something, is there a better way?

maelp avatar Jan 30 '24 14:01 maelp

similarly if I have a field with a "reserved name" like "timestamp" I cannot directly use it, I need to use s"timestamp", for instance

aggregate {
   min_time = min s"timestamp" -- using "min timestamp" would generate an error because "timestamp" is a type
}

isn't there a way to see that what follows is necessarily a value, and should be coerced to a value?

maelp avatar Jan 30 '24 14:01 maelp

Hi @maelp , good questions, thanks for opening an issue.

Check out https://prql-lang.org/book/reference/syntax/keywords.html — these are solvable with backticks and this. respectively:

from `cse(2024,1,5)`
aggregate {
   sum this.time
}
SELECT
  COALESCE(SUM(time), 0)
FROM
  "cse(2024,1,5)"

-- Generated by PRQL compiler version:0.11.1 (https://prql-lang.org)

Does that make sense?

max-sixty avatar Jan 30 '24 18:01 max-sixty

Thanks! However if I use this DuckDB MACRO

CREATE MACRO cse(year, weekA, weekB) AS TABLE SELECT * FROM read_parquet(list_transform(
            generate_series(weekA, weekB),
            week -> format('s3://${BUCKET_PATH}/year={0}/week={1}/0.parquet', year, week)
        ));

I think it doesn't create a "table" per-se, but should be used as SELECT * FROM cse(...)

do you think I should instead create a table? not sure if I can do this as a macro / function?

When I use your code I get the error:

Catalog Error: Table with name recent_cse(4) does not exist!

maelp avatar Jan 30 '24 20:01 maelp

The generated SQL is this

WITH table_0 AS (
  SELECT
    bms_id,
    GREATEST(value.temperature_mosfet, value.temperature_ic) AS _expr_0,
    GREATEST(value.temperature_front, value.temperature_back) AS _expr_1,
    LEAST(value.temperature_front, value.temperature_back) AS _expr_2
  FROM
    "recent_cse(4)"
)
SELECT
  bms_id,
  MIN(_expr_2) AS min_temp_cells,
  MAX(_expr_1) AS max_temp_cells,
  MAX(_expr_0) AS max_temp_bms,
  COUNT(*) AS count
FROM
  table_0
WHERE
  _expr_2 < -10
  OR _expr_1 > 40
  OR _expr_0 > 70
GROUP BY
  bms_id

I guess it should not use the " character because I want the macro to be run in duckdb no?

maelp avatar Jan 30 '24 20:01 maelp

I verified that if I run the same query without the quotes, eg not "cse" but directly cse it works

eg

WITH table_0 AS (
  SELECT
    bms_id,
    GREATEST(value.temperature_mosfet, value.temperature_ic) AS _expr_0,
    GREATEST(value.temperature_front, value.temperature_back) AS _expr_1,
    LEAST(value.temperature_front, value.temperature_back) AS _expr_2
  FROM
    recent_cse(4)
)
SELECT
  bms_id,
  MIN(_expr_2) AS min_temp_cells,
  MAX(_expr_1) AS max_temp_cells,
  MAX(_expr_0) AS max_temp_bms,
  COUNT(*) AS count
FROM
  table_0
WHERE
  _expr_2 < -10
  OR _expr_1 > 40
  OR _expr_0 > 70
GROUP BY
  bms_id

is there a way to generate this simply from PRQL?

maelp avatar Jan 30 '24 20:01 maelp

And if I try

from s"recent_cse(4)"
derive {
  min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
  max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
  max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
}
filter min_temp_cells < -10 || max_temp_cells > 40 || max_temp_bms > 70
group bms_id (
  aggregate {
    min_temp_cells = min min_temp_cells,
    max_temp_cells = max max_temp_cells,
    max_temp_bms = max max_temp_bms,
    count = count this,
  }
)```

it says

s-strings representing a table must start with SELECT

maelp avatar Jan 30 '24 20:01 maelp

Ah, so we want literally FROM cse(2024,1,5) and not the quoted FROM "cse(2024,1,5)".

In that case, I think we do need to use the from s"SELECT * FROM cse(2024, 1, 4)", since currently PRQL expects an identifier.

We do have prior art for something more with the read_parquet-like functions; so it's not out of the question to do something similar.


This probably isn't going to make this perfect immediately, but it would be possible to make a function so the inline call were elegant:

prql target:sql.duckdb

let cse = x y z -> s"select * from cse({x},{y},{z})"

from (cse 1 2 3)
derive x = 5
WITH table_0 AS (
  SELECT
    *
  from
    cse(1, 2, 3)
)
SELECT
  *,
  5 AS x
FROM
  table_0

-- Generated by PRQL compiler version:0.11.1 (https://prql-lang.org)

max-sixty avatar Jan 30 '24 20:01 max-sixty

would be interesting if we could have something like this! Or allow macros and duckdb functions in PRQL or something

maelp avatar Jan 30 '24 21:01 maelp

BTW is it possible to do a multi-line PRQL function (this does not seem to be shown in the documentation)

I'm trying to do a helper function, then call it like this

let checkTemps = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (from (recent_cse numWeeks)
    derive {
      min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
      max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
      max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
    }
    filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
    group bms_id (
      aggregate {
        min_temp_cells = min min_temp_cells,
        max_temp_cells = max max_temp_cells,
        max_temp_bms = max max_temp_bms,
        count = count this,
      }
    )
)

checkTemps 4 -10 40 70

but it doesn't work...

maelp avatar Jan 30 '24 22:01 maelp

The -10 needs to be parenthesized...

let checkTemps = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
from (recent_cse)
    derive {
      min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
      max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
      max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
    }
    filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
    group bms_id (
      aggregate {
        min_temp_cells = min min_temp_cells,
        max_temp_cells = max max_temp_cells,
        max_temp_bms = max max_temp_bms,
        count = count this,
      }
    )
)

checkTemps 4 (-10) 40 70

...works!

(also I changed (recent_cse numWeeks) as that's not a function in the example)

Parentheses can be a bit confusing in some limited circumstances — explained here: https://prql-lang.org/book/reference/syntax/operators.html#parentheses. The main thing we can do is to make the error messages much much better.

max-sixty avatar Jan 30 '24 23:01 max-sixty

would be interesting if we could have something like this! Or allow macros and duckdb functions in PRQL or something

We could have something very very simple like duckdb_macro cse 1 2 3. It's not elegant.

How do you / others think about macros vs functions? To the extent they're overlapping, I would deprioritize work here — having a boilerplate function for each macro doesn't seem too bad. To the extent they'll be coexisting lots, we could think about improving the interface.

max-sixty avatar Jan 30 '24 23:01 max-sixty

Would be helpful to have a better help message indeed! I thought the issue was something else

yes the boilerplate can work I agree, so not a big priority. It's just that I want to be able to do queries both in SQL and PRQL so I'd like to have macros available in both to avoid duplicating code

would it be possible to rewrite that macro in pure PRQL (eg do you have access to duckdb generate_series and list_transform in PRQL? possibly in order to have those available, having something like a ::duckdb:generate_series and ::duckdb:list_transform as tokens (or a kind of PRQL "module" that we could import like "import duckdb" and then "duckdb.generate_series", "duckdb.list_transform" and "duckdb.call_macro" could be useful?)

maelp avatar Jan 31 '24 09:01 maelp

Maybe you want to see this. https://eitsupi.github.io/querying-with-prql/method_chaining/

eitsupi avatar Jan 31 '24 09:01 eitsupi

Sorry to use this issue to ask questions but I find that the documentation doesn't show all the power of PRQL yet, for instance: is there a way to do a ternary operation or an if / then / else in PRQL?

I'd like to do something like

select {
  result = someTemp < -10 ? "below" : "above"
}

is that possible?

maelp avatar Jan 31 '24 10:01 maelp

Thanks!

Last question (and small "bug report")

when I use this it works fine

let check_temperatures = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
  from (recent_cse numWeeks)
  derive {
    min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
    max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
    max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
  }
  filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
  group bms_id (
    aggregate {
      min_temp_cells = min min_temp_cells,
      max_temp_cells = max max_temp_cells,
      max_temp_bms = max max_temp_bms,
      count = count this,
    }
  )
)

But if I'd like to "rephrase" the min_temp, max_temp, etc at the end to only display if they are above / under their respective threshold it fails, and the error message is weird (it complains about the "->" in the function definition)

let check_temperatures = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
  from (recent_cse numWeeks)
  derive {
    min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
    max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
    max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
  }
  filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
  group bms_id (
    aggregate {
      min_temp_cells = min min_temp_cells,
      max_temp_cells = max max_temp_cells,
      max_temp_bms = max max_temp_bms,
      count = count this,
    }
    derive {
      min_temp_cells = case [
        min_temp_cells <= minCellTemp => min_temp_cells
        true => ""
      ],
      max_temp_cells = case [
        max_temp_cells >= maxCellTemp => max_temp_cells
        true => ""
      ],
      max_temp_cells = case [
        max_temp_bms >= maxBMSTemp => max_temp_bms
        true => ""
      ],
    }
  )
)

(another small nitpick: why not using the same "->" for case and function, and introducing another symbol like "=>" ?)

maelp avatar Jan 31 '24 10:01 maelp

I think you forgot the comma in arrays.

case [
  foo # bad line
  bar
]

eitsupi avatar Jan 31 '24 11:01 eitsupi

ah indeed you're right, but the error message is confusing, @max-sixty would there be a way to show the "correct" error in those cases?

maelp avatar Jan 31 '24 11:01 maelp

@max-sixty would there be a way to show the "correct" error in those cases?

It's definitely possible. I'd say it's harder than I thought it would be when I originally proposed terser syntax...

For example, in that case we need to realize we're in a list, possibly attempt to execute a function bar with an argument foo & realize foo isn't a function — then add those up to say "maybe foo should have a trailing comma, and is part of a list". And we don't have column names atm.

I would really like to focus on this, and think this will be the focus of our GSOC application.

max-sixty avatar Jan 31 '24 19:01 max-sixty

(another small nitpick: why not using the same "->" for case and function, and introducing another symbol like "=>" ?)

FYI there was some discussion on this in a past issue... I don't think anyone had a really strong view, we ended up favoring consistency with other langs

max-sixty avatar Jan 31 '24 19:01 max-sixty