postgrest Ability to generate queries with distinct/group by

PostgREST should have a way to perform a SELECT DISTINCT query.

I suggest a new reserved query word 'distinct'.

With no value it would just do a select distinct e.g. /mytable?distinct
Otherwise it is a comma separated list of columns (or computed columns) which would be used inside of SELECT DISTINCT ON(....) e.g. /mytable?distinct=name would be SELECT DISTINCT ON(name) * FROM mytable

Note that the columns that you are distinct on will also need to have order applied.

Jul 19 '17 07:07 daurnimator

This has certainly been a frequently requested feature. My only hesitation is making "distinct" another reserved word, meaning it would block any filters on a column that happened to be called "distinct." However, like "select," this one is a sql reserved word too so it makes sense that we reserve it as a special url parameter.

I hesitate to try messing with the url parsing and query generation part of the code myself, that's more @steve-chavez's domain. Perhaps he can estimate the difficulty of this feature.

Aug 06 '17 01:08 begriffs

Not sure if this should be implemented, by allowing distinct to be applied to any column unrestricted clients could potentially ddos a database.

I've bumped into a slow distinct query in postgresql a while ago and solved it by using a group by instead of distinct, remember distinct generating a more expensive seq scan, I don't have the details anymore but a quick googling suggest the problem could still persist:

https://dba.stackexchange.com/questions/93158/how-to-speed-up-select-distinct https://dba.stackexchange.com/questions/63268/postgresql-being-slow-on-count-distinct-for-dates

As mentioned in one of the answers, you can mostly avoid doing distinct with better database-design or better queries.

So in general I think distinct is a expensive operation(similar to group by) and it should be better exposed behind a view or a stored procedure.

Aug 15 '17 15:08 steve-chavez

@steve-chavez this is a good point against this. PostgrREST should expose only safe/fast operations

Aug 15 '17 16:08 ruslantalpa

For what it's worth, DISTINCT is just syntactic sugar for GROUP BY and I think in modern versions of postgres it's even implemented using the same code. (And DISTINCT ON is a postgres extension from way back that's a bit of a performance hack). So @steve-chavez's conclusion is correct.

Aug 15 '17 16:08 russelldavies

As mentioned in one of the answers, you can mostly avoid doing distinct with better database-design or better queries.

I have many situations where I've wished for postgrest to have a distinct operator. e.g. a table that contains events (timestamp, device, event_type, data), I want the latest event for each device subject to filters such as timestamp/event_type. So how would you allow a query to be written that accomplishes this? with my proposal it would be http://myapp.com/api/events?timestamp=lt.2017-08-08&event_type=eq.alarm&order=timestamp.desc&distinct=device

Aug 16 '17 00:08 daurnimator

The query would have to be behind a stored procedure/view, it's definitely necessary to use distinct for some cases, but as I've mentioned before exposing distinct unrestricted poses a threat for the database, a similar argument has been made for group by support in https://github.com/begriffs/postgrest/issues/158#issuecomment-223765083.

Aug 16 '17 00:08 steve-chavez

The query would have to be behind a stored procedure/view,

When it's behind a view/stored procedure then you can no longer support the arbitrary filters that I love about postgrest.

it's definitely necessary to use distinct for some cases,

In that case I think it's a requirement that we include the functionality.

but as I've mentioned before exposing distinct unrestricted poses a threat for the database

If you are afraid of performance issues, then perhaps we can only allow distinct when the jwt claim (or other setting?) permits it? However I think there are already performance questions around postgrest: for a large table, the default route with no filters will be a huge response that may slow the database down to build.

Aug 16 '17 01:08 daurnimator

You can create a stored procedure that returns setof events and apply filters like in a table see the tests, the same for a view you can create it as in: create view distinct_devices as select distinct device, timestamp.. and do regular filters on that.

Aug 16 '17 02:08 steve-chavez

You can create a stored procedure that returns setof events and apply filters like in a table see the tests, the same for a view you can create it as in: create view distinct_devices as select distinct device, timestamp.. and do regular filters on that.

But the filtering needs to happen before the distinct operation

Aug 16 '17 02:08 daurnimator

There will always be a fight between the two opposing needs expose more sql features vs protecting the database, striking the balance is hard. in this particular thread i am with @steve-chavez for now.

No if features like this http://www.databasesoup.com/2015/02/why-you-might-need-statementcostlimit.html make it into the core, we might relax a bit and expose more of the sql, in that situation i would go with implementing the groupby (while at the same time, having the default setting for cost limit to a low value).

should we close this for now? i don't see it going further in the immediate future.

Aug 26 '17 20:08 ruslantalpa

This is probably my most wanted feature; and it certainly gets requested a lot in the chat room.

Perhaps we could have a toggle in the config file allow_distinct for the performance concerned?

Aug 27 '17 01:08 daurnimator

@ruslantalpa I also think that the idea of the cost limit is a good one, though is not that reliable a max limit could be worked out, that also is a must for other features like multiple requests in a transaction, I made a related comment on "Customize query timeouts" #249, maybe we should continue there with the discussion and see if we can implement that feature independently.

@daurnimator That toggle could be easily abused by the users, I don't think it should be added to the config, this phrase summarizes my reasoning:

A well-designed system makes it easy to do the right things and annoying (but not impossible) to do the wrong things.

Users should be thinking twice before exposing expensive operations as distinct, as it is now, PostgREST should not make it easy for users to make those kind of mistakes.

Aug 27 '17 03:08 steve-chavez

@daurnimator Have you had a look at pREST? It exposes more SQL directly, at the cost of potential performance issues, so may satisfy your DISTINCT needs via its GROUP BY support.

Aug 27 '17 13:08 russelldavies

Just chiming in to request this enhancement be implemented. Distinct and Group-by will make quite a few views unnecessary in our case.

Aug 14 '18 21:08 ashish369in

If one's really careful and uses the format function, this can be done with a proc and dynamic sql.

Example for DISTINCT:

create function projects(dist name) returns setof projects as $_$
begin
  return query execute format($$
    select distinct on(%I) * from projects
  $$, dist);
end; $_$ language plpgsql;

Then you can use all of pgrst filters normally:

-- filtering
curl "localhost:3000/rpc/projects?id=eq.1&dist=name&"
-- selecting
curl "localhost:3000/rpc/projects?select=id,name&dist=name"

Still, exposing DISTINCT to clients should be thought twice.

Nov 13 '19 01:11 steve-chavez

I've thought about this and came up with a solution. The idea is this. group by and function calling in select operate on the "virtual table" returned by the where filters. In context of public apis, you never want to return more then a couple of thousand rows, so considering this, alowing group by on a few thousand rows poses no risk. The problem then becomes how do you make views that do not allow returning a lot of rows at the same time. And it's done like this

first you have a function like this one

create or replace function validate(
  valid boolean, 
  err text,
  details text default '',
  hint text default '',
  errcode text default 'P0001'
) returns boolean as $$
begin
   if valid then
      return true;
   else
      RAISE EXCEPTION '%', err USING
      DETAIL = details, 
      HINT = hint, 
      ERRCODE = errcode;
   end if;
end
$$ stable language plpgsql;

and you define the view like so

create view test.protected_books as 
  select id, title, publication_year, author_id 
  from private.books 
  where 
    validate(
      (current_setting('request.path', true) != '/protected_books') or -- do not check for filters when the view is embedded
      ( -- check at least one filter is set
        -- this branch is activated only when request.path = /protected_books
        (current_setting('request.get.id', true) is not null) or
        (current_setting('request.get.publication_year', true) is not null) or
        (current_setting('request.get.author_id', true) is not null)
      ),
      'Filter parameters not provided',
      'Please provide at least one of id, publication_year, author_id filters'
    )

 ;

This is the way :grin:. The first parameter of validate can be any boolean expression and you can make arbitrary complex check on the incoming requests, for example on a time series table one could check that both start/end dates are provided and the diff is no more then a month. For this to work, the request.get needs to be made available in postgrest :), i've only implemented it my commercial version along with support for group by, function calling in select like max/sum ... and window functions, though not released yet, only on my computer and it seems to work great.

Aug 02 '20 18:08 ruslantalpa

So the general idea would be to protect an endpoint against unwanted, unlimited requests, by throwing an error for those.

I guess you could achieve that in a function called through pre-request (http://postgrest.org/en/latest/configuration.html#pre-request). If you throw an error there, that should be good. You would of course need the query data there as well - but maybe those could be passed in as an argument (e.g. of type json) to the pre-request function. I think this approach would be a bit more flexible regarding the "checking of query string filters", because you can look for filters where you don't know the name beforehand as well. I imagine this is not as easy with the request.get.... approach.

Aug 02 '20 20:08 wolfgangwalther

the request.get needs to be made available in postgrest

If that one refers to the url query string, then it would be better as request.query_string.<key> or maybe request.qs.<key>. I can see support in postgrest for that, though I'm not sure if the value(eq.some, lt.123) would be useful for other means than to check if the value is set. Unless we somehow change the pgrst prefix to the pg operator lt.123 to < 123(not worth it probably).

support for group by, function calling in select like max/sum

Support for aggregates looks cool. Personally I would prefer the pre-request approach that Wolfgang mentioned, that way the VIEWs are not tangled with pgrst logic. Looks like that could be done by conditionally checking the path in the pre-request and then checking if the filters are applied.

Come to think of it, the required filter approach looks like an extended version of pg-safeupdate.

Aug 18 '20 18:08 steve-chavez

pre-request was a design mistake imo. There's rarely a valid use-case for it that can't be implemented better in the proxy. Having the conditions checked in the view definition, with static boolean logic (and not in an unrelated imperative function) is exactly the point, a self contained "smart view". Query string as opposed to separate parameters can work (the query can be split up by a db function) but i don't know which way is better/easier to use, haven't thought about it

Aug 18 '20 19:08 ruslantalpa

about the values, you are talking as if the programing env in pg is some crippled thing, it's trivial to write a function to split the lt.123 value and let the consuming function decide what to do with that (since it knows the context)

Aug 18 '20 19:08 ruslantalpa

pre-request was a design mistake imo. There's rarely a valid use-case for it that can't be implemented better in the proxy.

How about checking a custom header for validity based on data in the DB? That's exactly what I am doing for a subdomain-based multi-tenancy setup. Proxy sets the subdomain as a header and redirects the request. If that client/subdomain doesn't exist, I need to throw. I wouldn't know how to do that without pre-request and only proxy-based. So definitely not a design mistake, but very much needed.

Having the conditions checked in the view definition, with static boolean logic (and not in an unrelated imperative function) is exactly the point, a self contained "smart view".

That's for sure a positive, I agree. Although you can turn that around as well: If you need multiple endpoints to be limited in a very similar way, you can reduce code duplication a lot by putting that in a pre-request function instead of the VIEW.

But anyway: When setting the query string parameters as variables, both approaches can be taken.

Query string as opposed to separate parameters can work (the query can be split up by a db function) but i don't know which way is better/easier to use, haven't thought about it

I think it was just a different name instead of the .get. part, that Steve was suggesting, but the idea was still to split the query string up into parameters already.

Aug 18 '20 19:08 wolfgangwalther

about the values, you are talking as if the programing env in pg is some crippled thing, it's trivial to write a function to split the lt.123 value and let the consuming function decide what to do with that (since it knows the context)

It will not be as trivial, though, once you start adding nested logic, like or=(a.eq.A,b.eq.B,...) and more complex stuff.

I thought about adding the already parsed filters for a second, but I doubt that will be any less complex.. - because this would mean we had to pass the whole logic tree in... I'm not sure whether that would be worth it, tbh, - but if we wanted to have the logic tree available, passing this as a JSON argument to pre-request seems much more likely to solve that..

Given the complexity of implementing something that really works well not only for a couple of cases, but also in general, I wonder, whether we should maybe take a completely different route to solve the original problem:

What do you guys think about adding some kind of limit on the expected cost of the query? So if we were to do something roughly along the lines of:

BEGIN;

SET request.x=... -- all kinds of variables, like we currently do (no need for .get. / .qs.)

EXPLAIN read_query;
SET pgrst.estimated_cost=<from the explain>;

SELECT pre_request_function(); -- can use the estimated cost here to throw if you'd like

read_query; -- or use the estimated cost in the query implicitly: VIEW definition throwing like Ruslan suggested.

Aug 18 '20 19:08 wolfgangwalther

We could even provide a sane default for pgrst.max_estimated_cost and throw immediately if we wanted to. Of course with a nice error message explaining how to increase/disable the allowed cost if needed. ;)

Aug 18 '20 19:08 wolfgangwalther

@wolfgangwalther Yeah, I believe that approach would also be worth exploring. Check this related issue: https://github.com/PostgREST/postgrest/issues/249#issuecomment-334629208 (some ideas around the pg_plan_filter module).

Aug 18 '20 20:08 steve-chavez

@wolfgangwalther Yeah, I would also prefer that approach. Check this related issue: #249 (comment) (some ideas around the pg_plan_filter module).

Oh, great - thanks for the link. Having different limits imposed for different roles is a great idea. Doing something like that in a pre-request function with the suggested pgrst.estimated_cost would also be easily possible, I guess. I think we're better off implementing this ourselves (by just providing that variable) then using pg_plan_filter. The problems with the per-user settings you're describing in the other thread are one thing - but also it's a lot less flexible, because we would now expect users to install that extension. If installing this extension would not be expected, we would be back at where we are right now: We can't make sure to execute "safe" querys only per default- so we would, again, be stopped from adding anything like DISTINCT / GROUP BY to the query syntax.

Aug 18 '20 20:08 wolfgangwalther

I agree about the extension shortcoming. Some other ideas:

EXPLAIN read_query;
SET pgrst.estimated_cost=<from the explain>;

AFAICT, the EXPLAIN result cannot be used as an expression in SQL. Maybe it's possible to execute an anonymous plpgsql DO BLOCK, obtain the result inside and then set it to our GUC?

Ideally we would enforce that the cost doesn't exceed a threshold. This could be an additional config option. If this config is enabled then we can allow group by/distinct.

Aug 18 '20 21:08 steve-chavez

I wouldn't know how to do that without pre-request

i just showed how, a few comments above with that validate function, it takes a bool as a parameter ... which can be anything, any function call or even a subquery. You just reached for the familiar "check it in a function" and didn't consider that the logic can be embeded in a view definition

you can reduce code duplication a lot by putting that

again, same thing, put that logic in a function and call the same function in the definition of each view so the "condition", even though in a function, is colocated in the view

we had to pass the whole logic tree in... I'm not sure whether that would be worth it

the parameters are in an array in apiRequest var which is available in the middleware ...

It will not be as trivial, though, once you start adding nested logic, like or=(a.eq.A,b.eq.B,...) and more complex stuff.

While that is true, the thing to remember here is we are trying to make sure a view is not called without a "main" filter column, the one that filters out the most of the rows and that column/condition will not be behind a or usually, i.e. you don't need to parse the or get parameter. Just because a solution does not solve every possible user case it's not worth implementing (especially since it solves most of the use cases)

query cost

the main point is to not allow long running queries, you are complicating things with explain, most of the time you just want to "kill" the request after a certain threshold, sort of like max_execution_time in php, that is how it's implemented in every stack, you don't need to estimate (since estimates can be wrong and you can not estimate the /rpc path). This is already possible by setting it per user or even providing it as a parameter in the connection string, i.e. every request that goes through postgrest can be set to terminate after say 1s (and it returns a custom error) and the proxy can check that error and ban an ip if it detects more then say 3 in a row ...

Given the complexity of implementing something that really works well not only for a couple of cases

It's not complex, it's 1 line. It does not need to work in all the cases it needs to work in most of the cases (and it does, only or parameter is a bit harder to parse, not trivial but not rocket science, and not postgrest needs to parse it but user code in the db )

The big picture is being missed here, the "get" parameters need to be in the context, not to solve the "protect view" problem, they just need to be there so that it enables the database code to make decisions based on them, it's the same reasoning as to why headers, path, method are in the transaction context.

Aug 18 '20 22:08 ruslantalpa

AFAICT, the EXPLAIN result cannot be used as an expression in SQL. Maybe it's possible to execute an anonymous plpgsql DO BLOCK, obtain the result inside and then set it to our GUC?

I think we have two choices:

parse the explain "offline" in haskell and then run the SET from there. Needs another round-trip to the server.
put that thing in a DO BLOCK and do it at once. You can't return results from the DO BLOCK, so you might have to run the EXPLAIN twice (because we are already running it for the estimated count, aren't we?).

Once we have a pg_postgrest extension to be able to define FUNCTIONs, we could put the EXPLAIN in a function that returns the result as JSON and sets the estimation variable as a side-effect.

Aug 19 '20 06:08 wolfgangwalther

I wouldn't know how to do that without pre-request

i just showed how, a few comments above with that validate function, it takes a bool as a parameter ... which can be anything, any function call or even a subquery. You just reached for the familiar "check it in a function" and didn't consider that the logic can be embeded in a view definition

You cited me out of context. You said it could be better done in the proxy. I said, I have no idea how to do that in the proxy alone.

The "check-function-in-view-definition" solution, obviously does not work for tables that are exposed directly. Following your suggestion I would need to create a VIEW for every table in my schema. Now, because VIEWs don't support row level security and are still only available as "security definer" in postgres, I would need to replicate all the row level securities in the VIEW definition as well. It's just a couple of lines in pre-request.

query cost

the main point is to not allow long running queries, you are complicating things with explain, most of the time you just want to "kill" the request after a certain threshold, sort of like max_execution_time in php, that is how it's implemented in every stack, you don't need to estimate (since estimates can be wrong and you can not estimate the /rpc path). This is already possible by setting it per user or even providing it as a parameter in the connection string, i.e. every request that goes through postgrest can be set to terminate after say 1s (and it returns a custom error) and the proxy can check that error and ban an ip if it detects more then say 3 in a row ...

You're right about the /rpc - estimating won't work there. But neither will "filtering the main columns", because you only filter on the result of the function call and that long running query has already been executed. If you do the RPC way, you will most likely pass in those values to restrict the resultset by function arguments, so you already have everything in place right now.

Estimates can be wrong, yes. But your approach to allow queries with certain filters set, while dis-allowing queries without those filters, is nothing else than a very crude estimation of the expected resultset. There is no difference conceptually, except that the postgres estimates are way more sophisticated and a lot more dynamic than the static check of filters. That seems like a big plus to me.

It makes a lot of sense to have a time limit for requests in place as well as another safe-guard - but if I understood correctly, then that is already possible with postgres tools, yes? I remember reading in the other thread (it might have actually been you writing this), that this would just kill the connection to the server, but the query would still be running on the server to completion? Maybe I missed something here or confused different concepts.

[...] the "get" parameters need to be in the context, not to solve the "protect view" problem, they just need to be there so that it enables the database code to make decisions based on them, it's the same reasoning as to why headers, path, method are in the transaction context.

I sort of agree with that in general. Still, I think, it makes sense to try to provide this metadata about the request in way, that makes it easy to use. And parsing the whole query string is not "easy to use". We are not just providing the plain HTTP request as well to let the user parse it, but we are splitting it up in headers, path, method, etc. already. So that's basically the same thing.

Aug 19 '20 06:08 wolfgangwalther

You cited me out of context. You said it could be better done in the proxy.

Yes that's true, I was just showing a easy solution for your specific problem that can be implemented directly in the db and does not require an additional roundtrip like the pre_request. It can also be done in a robust way in the proxy. On proxy startup you read all the subdomains an a map (this is lua ... you can connect to the db and run queries) and then for every request you check the map in memory without going to the db (+ some logic to update the map when a subdomain is added/removed). This way, not only you eliminate the pre_request roundtrip that is executed for EVERY http request, you also eliminate the additional query being executed against the subdomains table (though this complexity in the proxy is better to avoid in the start while the "condition in view" is fast enough).

I would need to create a VIEW for every table in my schema

That is the right way to use postgrest, api schema should be just views and functions, to have the separation between the model and the api and to have the freedom and power to implement custom logic in the db that is related to the api (but not the data model) , as in smart views, instead of triggers, computed columns ...

Now, because VIEWs don't support row level security

That is not true, RLS works when going through views, it's just the current_user is changing and it's always been that way

just kill the connection to the server, but the query would still be running

what you read is if the http client drops the connection then this happens, nothing to do with what i said above about query time limit.

but we are splitting it up in headers, path, method,

isn't it what I said originally? have every get parameter in a separate var?

But neither will "filtering the main columns", because you only filter on the result of the function call

You are conflating two things here, in that context i was debating query estimate vs query time limit approach, not related to "smart view that require filters", as you said, rpc already do the main filtering based on their arguments.

To reiterate, there are 2 distinct capabilities being talked about here, that both enable some safety a) provide get parameters (which in turn enable views to become smart and protect themselves against select with no filters) b) max_execution (works now) vs max_cost (hard to implement, requires a roundtrip and does not work on rpc) approach - which is a safeguard against expensive queries. If the max_cost would be a PG setting just like max_execution then great but it's not, it's just a lot of complexity, 25% performance drop for all requests (because of yet another roundtrip, just like pre_request). security/resilience is a team job (proxy/postgrest/db) and trying to do everything in postgrest is the wrong direction

Aug 19 '20 11:08 ruslantalpa

postgrest postgrest copied to clipboard

Ability to generate queries with distinct/group by

postgrest
postgrest copied to clipboard