prql icon indicating copy to clipboard operation
prql copied to clipboard

Feature Request: control what pipelines are enabled

Open armanm opened this issue 2 years ago • 4 comments

I have a use case in mind that probably make sense as a feature of PRQL. In my project PRQL helps users access data using a language resembling SQL without exposing the underlying DB to customers. Therefore, PRQL acts as a sandbox that prevents users from executing DDL and DML statements in Postgres and this is very helpful ❤️.

Now I want to offer support for joins, windowing and a few other pipelines to pro users and I've been thinking about how that can be done. One place to enforce these restrictions could be the PRQL compiler if it can be configured to restrict available pipelines – i.e. enable/disable things like joins, s-strings, windows, functions, etc ...

Personally I see two main use cases for why one would add support for a query language to their product:

  1. idiomatic access to data
  2. fine-grained control of what's possible through the query language

All that to say, maybe the compiler options can be extended to precisely define what features of PRQL are allowed – otherwise every user of PRQL who needs that control has to implement it for themselves.

armanm avatar Dec 26 '22 07:12 armanm

Interesting! You're correct that PRQL is designed for the query use-case, though because that's where we perceive the greatest need.

How do you tradeoff having these constraints at the DB level — i.e. permissions attached to users — vs at the language level?

What would be the logic behind allowing / disallowing language features? At the moment anything is possible™ with s-strings, so I could see some logic there. What about others?

To set expectations: my sense is that we shouldn't have this on the current roadmap, since it would add complication to both the compiler and the interface, and permissions are best managed at the data layer. But open minded, and it's a useful signal that you're not the first person to suggest this.

max-sixty avatar Dec 26 '22 17:12 max-sixty

@max-sixty thanks for responding so quickly. This reply may get a little philosophical to further explain the thinking here but please bear with me 😄

I think we can agree that there are going to be a number of different scenarios under which PRQL is going to be a great choice. The obvious ones that I can think of are:

  • one finds the PRQL syntax more ergonomic over SQL
  • no DML and DDL aspect is super handy ... this is being discussed in #1092

There is a third category of users (me) who view PRQL in a less obvious way:

PRQL language is pretty sweet but it will be sold short by offering itself as only a transpiler for SQL (or at least I don't perceive the PRQL website claiming otherwise). OTOH almost all relational db systems with SQL language support a greater number of features than PRQL will support for a really long time. For the majority, accessing existing database using SQL is no issue.

Therefore the question that this project should attempt to answer is what is the key value of PRQL besides its cool syntax but yet limiting language? At this point it's worth drawing a parallel to CoffeeScript which found small adoption but eventually was rejected by the community in favour of JavaScript.

CoffeeScript is a JavaScript-like language that compiles into JavaScript. Looking at the syntax, it's kind of a combination of Ruby and Python. Technically, it's very, very nice. And pretty, especially compared to JavaScript, which isn't pretty. The problem is, CoffeeScript solves a problem that really isn't a problem. https://www.quora.com/What-are-disadvantages-of-using-CoffeeScript/answer/Mattias-Petter-Johansson

This may sound like I'm criticising PRQL but in fact I have chosen to adopt it and want to see it thrive. I choose PRQL because hopefully it can offer control over aspects of the language that will make it versatile and extensible.

How many other developer friendly SQL parsers are out there that can help a SaaS product expose data using a SQL like language? (I'm speaking of things like RQL for Rollbar, Dimensions Search Language from Uber research or Streaming SQL from Materialize.com - all are examples of SaaS products that expose data using SQL). The answer to that question is very few – which means developers have to build the language, its parser and a compiler from scratch if they can't find a jumping point and the thesis of this reply is that PRQL should be primed to fill that gap and be a jumping point for developers.

There really is no need to assume that PRQL output will always be fed into a conventional SQL query engine. For developers, building a command line utility for querying an excel sheet or listing files from disks using the PRQL language should totally be possible. In other words PRQL language should be geared towards, flexibility, easy grammar to adapt and overall domain agnostic.

In my case I have some data in Postgres but in fact the rest of it are files on disks to be consumed by Apache Arrow Datafusion which has no concept of access control. Things like DuckDB and Sqlite have no concept of access per table. Suddenly if you want to limit joins and various functions, things are very limiting. This is where something like PRQL can help to create governance for your domain through the language.

armanm avatar Dec 27 '22 07:12 armanm

@armanm I very much agree with your thoughts! I think they're very perceptive. Joining you for a moment in zooming out:

  • PRQL could be Coffeescript — a cool idea but not different enough to JS to succeed (and arguably the growth of frameworks like React, despite solving a different problem, reduced some of the pain of JS). I second your explanation of why it wasn't successful.
  • PRQL could also be Kotlin — compatible with JVM, but different and improved enough that it's captured a large share of the Java market, made developers' lives better, and allowed more people to write software.

And so our assignment is to figure out the most important dimensions, such that users are willing to bear the start-up cost of a new language, and tool-builders are incentivized to build on top of PRQL.

And then zooming back in: I agree that governance is a very salient dimension for some folks. OTOH I don't have a great visualization of how we'd add lots of value there at the moment. We could ban s-strings, but that hardly adds fine-grained access management. My guess is that we'd need some sort of semantic layer with a map of tables and how they can be accessed — possible but not that close to our current state. So we'd need to wait for more demand signal before embarking on it. Another possibility is that the compile offers enough transparency into what's being requested that another tool could sit between the compiler and databases, and do this excellently.

Does that make sense to you? What do you think I'm missing?

max-sixty avatar Dec 27 '22 20:12 max-sixty

Oh awesome, I see that we are on the same page ✨

  • PRQL could be Coffeescript — a cool idea but not different enough to JS to succeed (and arguably the growth of frameworks like React, despite solving a different problem, reduced some of the pain of JS). I second your explanation of why it wasn't successful.

  • PRQL could also be Kotlin — compatible with JVM, but different and improved enough that it's captured a large share of the Java market, made developers' lives better, and allowed more people to write software.

Now that we are going down this thought exercise another good example to throw into the mix is TypeScript which objectively is a success story through its tight integration with VSCode but really the key value was a working intelliesense if you ask me and also why there is no space left for Facebook Flow.

The vision to replicate a Kotlin like success holds water and observing the community's use of PRQL is one way to validate that vision.

We could ban s-strings, but that hardly adds fine-grained access management

For use cases that I've presented s-strings end up being a showstopper. Even if not having it is no fine-grained access management, adding support to turn it off continues to makes sense because it will lock users down to what's currently supported in PRQL.

I 100% think this should be done. Such an option is comparable to Duckdb's access_mode or enable_external_access options.

My guess is that we'd need some sort of semantic layer with a map of tables and how they can be accessed — possible but not that close to our current state. So we'd need to wait for more demand signal before embarking on it.

Agreed – there isn't enough information for a blueprint of what that could look like. Let's kick the can down the road on this approache.

Another possibility is that the compile offers enough transparency into what's being requested that another tool could sit between the compiler and databases, and do this excellently.

This feels like the right balance to me too. I simply need enough capabilities within the PRQL compiler to let me experiment with what may right now feel like a fringe use cases.

If the compiler can expose the following information, this will open the door for many possibilities that shouldn't be baked into PRQL now or maybe never:

  • what tables are being used and if there are any joins
  • what functions are being called
  • when s-string is being used

The final piece is support for additional standard library functions – i.e. ability to expand these functions through configuration. This paragraph from the PRQL roadmap literally makes the case for it because the standard library is related to the domain. It's unlikely that users will require the stdard library for MS SQL, Oracle, Postgres, SQLite, etc ... all together at once. However if they could be user defined then the entire problem could go away from PRQL's point of view.

Currently, the standard library is quite limited . It contains only basic arithmetic functions (AVERAGE, SUM) and lacks functions for string manipulation, date handling and many math functions. One challenge here is the variety of functionalities and syntax of target DBMSs; e.g. there’s no standard regex function. Improving our testing framework to include integration tests will help give us confidence here.

armanm avatar Dec 28 '22 01:12 armanm

If the compiler can expose the following information, this will open the door for many possibilities that shouldn't be baked into PRQL now or maybe never:

* what tables are being used and if there are any joins

* what functions are being called

* when s-string is being used

Perfect, I think this would be excellent — it would enable this specific case and be very general. I've also added this to the latest roadmap update at https://github.com/PRQL/prql/pull/1374

If it's OK, let me rename this issue to reflect the more general scope.

max-sixty avatar Jan 03 '23 19:01 max-sixty

@armanm Have you checkout our RQ output?

Here is the rustdoc (which is a bit outdated, maybe you want run cd prql-compiler && cargo doc --open from the repo).

For this query:

from albums
derive my_col = s"some_sql_func({title})"
join artists [==artist_id]

... cargo run resolve _a.prql produces:

{
  "def": { "version": null, "other": {} },
  "tables": [
    {
      "id": 0,
      "name": "albums",
      "relation": {
        "kind": {
          "ExternRef": {
            "LocalTable": "albums"
          }
        },
        "columns": [
        	{ "Single": "title" },
        	{ "Single": "artist_id" },
        	"Wildcard"
        ]
      }
    },
    {
      "id": 1,
      "name": "artists",
      "relation": {
        "kind": {
          "ExternRef": {
            "LocalTable": "artists"
          }
        },
        "columns": [
        	{ "Single": "artist_id" },
        	"Wildcard"
        ]
      }
    }
  ],
  "relation": {
    "kind": {
      "Pipeline": [
        {
          "From": {
            "source": 0,
            "columns": [
              [{ "Single": "title" }, 0],
              [{ "Single": "artist_id" }, 1],
              ["Wildcard", 2]
            ],
            "name": "albums"
          }
        },
        {
          "Compute": {
            "id": 3,
            "expr": {
              "kind": {
                "SString": [
                  { "String": "some_sql_func(" },
                  {
                    "Expr": {
                      "kind": {
                        "ColumnRef": 0
                      },
                      "span": { "start": 1157, "end": 1162 }
                    }
                  },
                  { "String": ")" }
                ]
              },
              "span": { "start": 1131, "end": 1165 }
            }
          }
        },
        {
          "Join": {
            "side": "Inner",
            "with": {
              "source": 1,
              "columns": [
                [{ "Single": "artist_id" }, 4],
                ["Wildcard", 5]
              ],
              "name": "artists"
            },
            "filter": {
              "kind": {
                "Binary": {
                  "left": {
                    "kind": {
                      "ColumnRef": 1
                    },
                    "span": {
                      "start": 1180,
                      "end": 1191
                    }
                  },
                  "op": "Eq",
                  "right": {
                    "kind": {
                      "ColumnRef": 4
                    },
                    "span": { "start": 1180, "end": 1191 }
                  }
                }
              },
              "span": {
                "start": 1180,
                "end": 1191
              }
            }
          }
        },
        {
          "Select": [0, 1, 2, 3, 4, 5]
        }
      ]
    },
    "columns": [
      {"Single": "title"},
      {"Single": "artist_id"},
      "Wildcard",
      {"Single": "my_col"},
      {"Single": "artist_id"},
      "Wildcard"
    ]
  }
}

(I haven't read the whole discussion, so I apologize in advance if come across as ignorant.)

aljazerzen avatar Jan 04 '23 11:01 aljazerzen

I actually didn't realize we had come so far on this, that's very impressive. Maybe I add some docs on this?

I also think we could add things like the lineage of columns or a list of features without too much trouble, if folks like @armanm would like it

max-sixty avatar Jan 04 '23 20:01 max-sixty

It is documented as the API of the prql-compiler crate and I wouldn't promote it too much, because I often find something I want to change.

aljazerzen avatar Jan 05 '23 09:01 aljazerzen

OK. I'm happy to help document & solidify it if someone starts consuming it.

So I think we're at a decent place:

  • Observability of the query is useful, both in general, and specifically so folks like @armanm could control what features end-users have access to
  • The compiler currently produces a decent amount of data to allow that, like the tables & columns
  • If folks needs more info (e.g. a list of functions), that feature can be expanded
  • Probably the "filter what features are permitted" shouldn't be in the prql-compiler crate; instead it's specific enough that someone could build something on top of PRQL. Though if we got lots of demand signals for this features, we could reconsider this as something primary.

So I'll close this given we're in that place, but feel free to comment here / open another issue with any further threads. Thanks

max-sixty avatar Jan 05 '23 18:01 max-sixty

It is documented as the API of the prql-compiler crate and I wouldn't promote it too much, because I often find something I want to change.

BTW on reflection I think this is impressive enough that we would benefit from having something demonstrating this, even if it has a big warning that it'll change.

I've also been reflecting on why we haven't been successful at getting folks to integrate PRQL into their tools so far — I wonder whether giving access to metadata like this (+ extensions, like a DAG for each column) would let folks build some really cool things...

max-sixty avatar Jan 07 '23 06:01 max-sixty

@max-sixty sorry for a super late reply. I had to step away from this problem for a little while. I now see this issue has received more attention which I'm very grateful for 🙏

I also think we could add things like the lineage of columns or a list of features without too much trouble, if folks like @armanm would like it

Absolutely ❤️

support for additional standard library functions – i.e. ability to expand these functions through configuration.

What did you think of the idea to let users expand the list of supported functions? I strongly wish to avoid allowing users to use s-strings in order to inject arbitrary text into the query but I suspect in most cases a user would do that to call a function which PRQL doesn't support out of the box.

I've also been reflecting on why we haven't been successful at getting folks to integrate PRQL into their tools so far — I wonder whether giving access to metadata like this (+ extensions, like a DAG for each column) would let folks build some really cool things...

I'm one of those folks looking at PRQL and wondering if it's the right thing to use and build an analytics tool on top of DataFusion. Arguably PRQL syntax offers a nice alternative to SQL but that isn't what's interesting about it to me. PRQL is interesting because it nicely separates query from where the query will finally run and where I may be able to build governance specific to my product. What I want from PRQL is maximum observability and control and I suspect anyone looking at PRQL for their product will probably have a similar view.

armanm avatar Jul 31 '23 09:07 armanm

What did you think of the idea to let users expand the list of supported functions?

Yes! We have early support for modules & multi-file projects (here's an example, so this is possible and will become better supported.


PRQL is interesting because it nicely separates query from where the query will finally run and where I may be able to build governance specific to my product. What I want from PRQL is maximum observability and control and I suspect anyone looking at PRQL for their product will probably have a similar view.

Yes, I very much agree. I think our "strategy" is to have PRQL be useful now by compiling to SQL — solving the chicken & egg problem — and then becoming a better language for expressing these over time.

So to the extent there are features that you think would immediately unlock the longer-term vision, we'd be v interested in those. My expectation is that we're not quite there yet, and so we should continue predominantly focusing on the SQL use-case, with an eye to features that unlock the longer-term vision...

max-sixty avatar Jul 31 '23 18:07 max-sixty