prql PRQL for Multidimensional data

Thanks for creating prql! It seems like a great idea.

I tried to try prql for the first time today. I followed the installation instructions and installed prql-python 0.3.1 in my python 3.10 environment. Then I tried to run the example code. On

sql = prql.compile(prql_query)

I got AttributeError: module 'prql_python' has no attribute 'compile'. Looking closer, and comparing the to tests

https://github.com/PRQL/prql/blob/eedc7fa3ac915ac79cf5a3de25fde4416fa49827/prql-python/python/tests/test_all.py#L14-L21

...I found that my API is different .dir(prql) shows

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'prql_python',
 'to_json',
 'to_sql']

This is a confusing experience for me as a first-time prql user. Am I doing something wrong?

Dec 31 '22 03:12 rabernat

Hi @rabernat ! Great to hear from you. And sorry that your first experience here is not so good.

We've just changed some of our APIs (here) and so there's a disconnect between the currently released version and main — the docs are on main with compile, and the most recent release uses to_sql.

Until we do a 0.4.0 release, compile is to_sql.

Let me know any issues with that at all...

Dec 31 '22 03:12 max-sixty

Gotcha, thanks a lot Max!

Now that I've got you here, I'll briefly share my idea, which will probably not surprise you. I wanted to play around with the idea of creating of compiling PRQL to xarray code. 😆 I'm convinced having some kind of query language interface to xarray would open up analytics to a lot more users. So I'm not interested in compiling to SQL but instead I would like to access the AST directly...is this just the json?

I'm sure you've thought about this idea as well, and I'd be curious to get your thoughts.

Dec 31 '22 14:12 rabernat

I wanted to play around with the idea of creating of compiling PRQL to xarray code.

It would be quite the combination. I'd be very keen to help.

It's crossed my mind before; here are some initial thoughts:

A query language that works as text does have some advantages[^1].
- It's self-contained / "flat" — no chance of "did you import X?"
- Flexible — can pipe it from the CLI / use it from a browser, no need to have a specific runtime
- Copy & paste generally works — maybe not always a good thing, but nice for beginners.
But Xarray already has a good API!
- I would generally prefer write PRQL than Pandas for simple transformations (and I'd mostly rather write pandas than SQL), but xarray is good (as a result of your & others impressive design & constant dedication).
- I was increasingly using xarray for 2D data. But the SQL hegemony is strong — where the data is already in a DB, pulling it out into memory / Zarr to use the xarray API seemed inefficient. So eventually got frustrated enough using SQL that I thought we could do better.
Are you thinking this would be something like an easy way into xarray / xarray's data format?
- I could imagine that being quite useful if we're willing to give some up things, like performance.
- For example, "do I use where or sel"; "what's a coord vs. an index" are questions I get a lot, and often for datasets which are small enough that it wouldn't matter to do the wrong thing.
- So maybe an approach that handled these for you could be good? You know the use cases much better than I do.

We'd have to figure out how to represent dimensions / variables / coords / indexes / etc

This is a 2 minute effort, but maybe something like

# Specify the file name here (even https...)
from `temperatures.zarr`

# filter works like sel (and maybe like where for non-indexes?)
filter (date | in @2020-01-01..@2021-01-01)

# select operate on variables rather than columns
select [lat, long, date, temp] 

# derive is basically the same as PRQL now, but no need 
# for a window to calculate groups for one variable over a subset of dimensions
derive temp_avg = (aggregate (average_temp))

derive temp_avg_date = (  
  group [lat, long] (
    aggregate (average temp)
  )
)

Still lots to figure out — what about isel / merge / etc?

I'm very happy to help, I'm especially happy to help on the "PRQL infra" side, if we needed any changes to PRQL to make something like this work.

[^1]: This doesn't feel like a complete explanation of why text-based queries seem so attractive. Very open to other ideas!

Dec 31 '22 23:12 max-sixty

While we probably could also do multidimensional queries with PRQL, I think we should stick to relational model, at least for now.

Also, I don't see how multidimensional is different from having a relation, whose primary key is compound of all dimensions (but that is probably my limited view on the matter).

Jan 02 '23 14:01 aljazerzen

While we probably could also do multidimensional queries with PRQL, I think we should stick to relational model, at least for now.

I agree it's not a project priority — we should keep focused, especially now.

But I'm def fine to spend some time on the idea given my experience with xarray (though I won't be the primary person driving it unfortunately, given PRQL's focus). And I don't think it would need to impact the standard relational query.

(and I think it's v speculative at this stage)

Also, I don't see how multidimensional is different from having a relation, whose primary key is compound of all dimensions (but that is probably my limited view on the matter).

Yes, good question. Briefly:

In a very abstract way they're equivalent, but in the same way that "all turing-complete languages are equivalent".
Having a multi-dimensional array is great for navigating those dimensions; i.e. you can ask for the next date or the next latitude, and those are the same for all latitudes & dates respectively
Storing dense multidimensional arrays in memory allows for extremely high performance across any dimension. e.g. "next date" is just moving a pointer; no need to work out what the next date is / query an index / etc. Summing over any number of dimensions, or calculating a moving average is similarly fast.
We can write functions that are abstract over number of dimensions without explicit map or loop logic. Have a function that calculates the returns to a portfolio of securities — e.g. (ds.weight * ds.constituent_return).sum('constituent')? Then pass an array in with multiple portfolios on a portfolio dimension, and you get back a result that has the returns for multiple portfolios. Want to do the same for multiple portfolios in multiple countries — add another country dimension!
The Xarray data model allows for multiple aligned variables — you can see in the logo at https://docs.xarray.dev. So we can have a Temperature array dimensioned by (date, lat, lon) and a Height array dimensioned by (lat, lon), aligned which each other, which makes operations between those again convenient and performant. In a relational model, those would need to be multiple tables with a foreign-key relationship, or nested arrays (whose nested column is secondary to the primary column). That means that with xarray you can have all your working data in a single aligned object you pass around, it's beautiful.

(ok not as brief as I'd hope but regardless...)

Jan 02 '23 18:01 max-sixty

prql prql copied to clipboard

PRQL for Multidimensional data

prql
prql copied to clipboard