postgrest icon indicating copy to clipboard operation
postgrest copied to clipboard

Improvements to query language

Open wolfgangwalther opened this issue 4 years ago • 15 comments

We have discussed a few shortcomings of our current query language (expressions in the query string). Taking those suggestions together - and adding a few more points, that seem to be inconsistent to me in the current implementation.

I wonder whether we should try to implement all those improvements as non-breaking separated changes - or just invent a query language v2, where we can chose via config option which language to use. Maybe #1804 can lead us to have a cleanly separated QueryLanguage module that we can replace with a new module.

Here we go:

  • Change operator syntax from col=op.val to col.op=val: #1970. This will allow to use swagger-ui to send requests. It will also allow plain-html requests without javascript like this:

    <form action="/table?select=html">
       <input type="text" name="title.eq" />
     </form>
    

    This was not possible before, because the value would have to be constructed dynamically.

  • Separating filter logic from values: #1970. This will allow doing more complex queries in swagger-ui involving logic conditions, too. It will also simplify a lot of our qouting requirements already. Let's put all logic into a logic=... parameter. Maybe filters= is even better.

  • Replace current in with col.eq.any=<array> as suggested in #1943. I'd go one step further and remove in completely. We are using ANY under the hood anyway and together with logic=, we would remove all manual quoting we'd need to do. We'd rely only on PG array quoting.

  • Move select and order to parameters of the mimetype in the Accept header. Both parameters only shape the response - but the underlying entity is the same. As such, I think they don't really fit in the query string, which should describe which entity to receive. This would also remove the strange interaction we have between the select= and the Accept header already: When you request a custom mimetype, e.g. application/octet-stream or text/plain, the whole select= doesn't make any sense. So the availability of select is already dependent on the mediatype. Example: Both of the following requests will return the same person. Returning fewer or more columns is not much different to just returning the same entity in a different format.

    GET /people?uid.eq=123&select=*
    GET /people?uid.eq=123&select=lastname,firstname
    

    would become

    GET /people?uid.eq=123
    Accept: application/json;select=*
    
    GET /people?uid.eq=123
    Accept: application/json;select=lastname,firstname
    

    I'm not sure whether we can still use the mimetype application/json here. The registration mentions no parameters at all: https://www.iana.org/assignments/media-types/application/json. It specifically hints that charset is not a valid parameter, but we are returning it anyway. At the same time RFC6838 says:

    New parameters SHOULD NOT be defined as a way to introduce new functionality in types registered in the standards tree, although new parameters MAY be added to convey additional information that does not otherwise change existing functionality.

    As we would still be returning valid json, I think we would not change existing functionality, so adding parameters should be fine. If we end up not wanting to do this, we could still do so with our own custom media type, though.

  • Remove limit and offset parameters entirely - this can be done via the respective range headers.

  • Replace !inner on the embedding with an exists filter as suggested in https://github.com/PostgREST/postgrest/issues/2009#issuecomment-981202986. Note, this does not really play well with moving the embedding to the Accept header, yet. I don't think referencing a definition that is sent in the header from the query string is a wise idea. But given how the SQL for an exists query actually looks like (EXISTS (... sub query...)), maybe we can do something like exists=<embedding-spec>:

    GET /people?exists=jobs!job_fk
    

    This means that in some cases the embedding + hint needs to be repeated between query string and header, but this makes thing much more explicit. Later on, we could even support true filtering via EXISTS (...), which could then be combined with other filters in the logic= part. And only if it's part of a top-level AND and requested in the select= part, then we would make it a INNER JOIN.

  • Move columns to a parameter of the mimetype in the Content-Type header, which describes the request body. Same arguments apply as above.

  • Move on_conflict to a parameter of the Prefer: resolution header. POSTing the same request once with on_conflict and once without really does not change the returned entity. Parameters on prefer headers can be given like this, according to the RFC:

    Prefer: resolution=merge-duplicates;col1;col2;col3
    
  • Change the Accept: application/vnd.pgrst.object+json way to request a single resource to a path parameter something like this:

    GET /people;single?uid.eq=123
    GET /people;one?uid.eq=123
    

    This is because, the ability to query for a single row instead of multiple rows is really not inherent to the media type requested. Right now, it's not possible to query for a single row in text/csv, application/octet-stream, text/plain and other formats, which is a severe limitation. Changing it like this would also allow to people to rewrite those PK-equality requests easily via nginx from something like /people/123. And since there is no interaction with the mimetype, this will not conflict with custom accept headers that might be sent.

This would leave only filters (identified by at least one . before the =), arguments (without .) and logic= query parameters, reducing the amount of "reserved keywords" by a fair bit.

Using parameters to all kinds of things etc. is also much more consistent with what the RFCs allow for those, instead of cramping everything into the query string.


Not sure whether everything in here can really be done. I haven't found a good solution for filters / limits / offsets on embedded resources, yet. It doesn't make much sense to move those to the accept header - but keeping them in the query string would create a strange interaction between query string and accept header once again.

One idea, that I haven't fully thought through, yet, could be to split the embeddings from the select part. If the embeddings were defined in the query string, but the select= in the accept header would reference those, that would be fine, I think. Having an interaction in this direction between header and query string would make sense, imho.

Embedding other resources really changes the entity we are requesting, too - so they should not be moved to the header. What about using the flexibility the path specification of a URI gives us for embedding?

GET /people,jobs

GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss
Accept: application/json;select=name,age,jobs.title

This would also play nicely with the ;single or ;one parameter for requesting a single entity, which is kind of similar to a hint of relationship type to embedd !m2m, !o2m, etc.:

GET /people;one,jobs;o2m?...

They both do kind of the same thing in returning an object instead of an array.

wolfgangwalther avatar Dec 03 '21 08:12 wolfgangwalther

I like many of the ideas from an HTTP standpoint, but they'd be such a big breaking change(consider all client libraries we have now) that we might as well change our name if we did them. Many users will wonder why they should not just try to migrate to a GraphQL/OData(which are behind standards) solution if we do such a change.

or just invent a query language v2

Yes, this is why it's important to document our REST syntax(ref), maybe we could do a RESTQL/RSQL/RESQL V2 if we'd ever adopt such a change. But even if we were to support both our v1 and v2 at the same time, I think that would still create doubts about our stability.

So I think before doing any major breaking changes to our REST syntax, we should offer extensibility(https://github.com/PostgREST/postgrest/issues/1909#issuecomment-952375549) to provide a way to be compatible with other standards(GraphQL/Odata).


Some comments:

GET /people?uid.eq=123 Accept: application/json;select=*

This looks better from a REST/HTTP standpoint, but there are many crippled http clients out there that can't send headers, we would leave those out or require even more use cases for proxies.

GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss

Ah, isn't ; required to be urlencoded? I'm not sure, but it seems too likely that we'd be creating impedance with http clients out there, causing similar problems to the ones our current or/and operators generated. Also, IMHO ; somehow looks bad in the URL path.

GET /people?exists=jobs!job_fk

This one looks good and is actionable, will take it to a new issue.

steve-chavez avatar Dec 03 '21 18:12 steve-chavez

Ah, isn't ; required to be urlencoded? I'm not sure, but it seems too likely that we'd be creating impedance with http clients out there, causing similar problems to the ones our current or/and operators generated. Also, IMHO ; somehow looks bad in the URL path.

No, they are specifically allowed for that purpose: https://datatracker.ietf.org/doc/html/rfc3986#section-3.3

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.

wolfgangwalther avatar Dec 03 '21 19:12 wolfgangwalther

I like many of the ideas from an HTTP standpoint, but they'd be such a big breaking change(consider all client libraries we have now) that we might as well change our name if we did them. Many users will wonder why they should not just try to migrate to a GraphQL/OData(which are behind standards) solution if we do such a change.

I certainly don't want to propose a general breaking change in the API here. I think there is value in "providing the most HTTP-standard compliant interface" in itself, but also many of those proposals solve actual problems we have.

Yes, this is why it's important to document our REST syntax(ref), maybe we could do a RESTQL/RSQL/RESQL V2 if we'd ever adopt such a change. But even if we were to support both our v1 and v2 at the same time, I think that would still create doubts about our stability.

So I think before doing any major breaking changes to our REST syntax, we should offer extensibility(#1909 (comment)) to provide a way to be compatible with other standards(GraphQL/Odata).

If we offer an extensible interface for the query language and have both our V1 and V2 to make use of that interface / prove that it's working well - I don't see us creating doubts about stability. It's the other way around, I think. It increases confidence in a stable non-breaking interface, because new features, that would imply breaking changes can much easier be developed while keeping backwards compatibility in other versions of our language.

wolfgangwalther avatar Dec 03 '21 19:12 wolfgangwalther

Accept: application/json;select=lastname,firstname

Now that we're closer to having XML support for tables, the select above seems redundant for each mimetype.

I was wondering if there's a more standard way for embedded resources. I've been looking at HAL, but its verbosity makes me think it's not adequate for big result sets. Also users would need HAL-specific clients to really use the data.

Following the line of thought of changing the path syntax and that embedding resources have a different type, I thought of representing the result of an embed as a different resource - GET /usersEmbedClients - but that of course wouldn't work for many embeds and tree-structured data.

An embed query parameter looks more "standard", if we'd adopt it I guess we could do:

GET /projects?select=id,name,clients(name,tasks(name))

As

GET /projects?embed=clients&clients.embed=tasks&select=id,name&clients.select=name&clients.tasks.select=name

Which has the disavantage of being more verbose and more importantly it would lose the ability of having an order for the fields in the resulting body(since proxies don't guarantee the order of the query parameters IIRC).

With the http QUERY method, IMO the http working group likely has given up on representing embedded data cleanly within the URL syntax restrictions(there's even SQLish examples in the spec). With that, it feels futile trying to improve our embedded resource syntax.

I do agree that changing the operator syntax would be a win though.

steve-chavez avatar Jun 13 '22 02:06 steve-chavez

Once https://github.com/PostgREST/postgrest/issues/1970#issuecomment-1382613004 is implemented, I think we could support both the old and new filter syntax without a breaking change.

Change operator syntax from col=op.val to col.op=val

Looking at the parsers I think it'd be easier to do op.col=val than col.op=val.

col=val would be eq.col=val implicitly.

steve-chavez avatar Feb 04 '23 01:02 steve-chavez

Some ideas..

Embedding other resources really changes the entity we are requesting, too - so they should not be moved to the header. What about using the flexibility the path specification of a URI gives us for embedding? GET /people,jobs GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss Accept: application/json;select=name,age,jobs.title

Doesn't embedding just add attributes to the entity? You still do a FROM base_entity JOIN other on SQL after all. We could just say an embedding is a different representation of a resource(table).

Edit: Totally misread the RFC. It was related to media type parameters not query parameters. Still the new media type below might be worth considering.

~~Also RFC 2046 says:~~

~~Parameters are modifiers of the media subtype, and as such do not fundamentally affect the nature of the content. The set of meaningful parameters depends on the media type and subtype.~~

~~So the query parameters depend on the media type, we don't need to put the select on the header. Also, omitting columns from select doesn't fundamentally change the nature of the content - it's still json, we're just leaving some fields as 'undefined'.~~

Considering the above, it sounds possible to document the embedding possibilities for a resource. In OpenAPI we can list the Accept media types. Using films in resource embedding, we could say it accepts:

paths:
  /films:
    get:
      ...
          content:
            application/vnd.pgrst.related+json; rel=films>-directors
            application/vnd.pgrst.related+json; rel=films-<nominations
            application/vnd.pgrst.related+json; rel=films-<roles
            application/vnd.pgrst.related+json; rel=films>-<actors
            application/vnd.pgrst.related+json; rel=films>-<competitions

>- = many-to-one. -< = one-to-many. >-< many-to-many. one-to-one could be --.

So say we get the following request(with nesting):

GET /films?select=*,roles(*, actors(*)),directors(*)

We would respond with our usual json, but with the added header:

Content-Type: application/vnd.pgrst.related+json; rel=films-<roles,films>-directors,roles>-actors

The gain here is that we could document the relationships a resource has. Which was previously impossible in OpenAPI.

steve-chavez avatar May 19 '23 04:05 steve-chavez

Considering the above, it sounds possible to document the embedding possibilities for a resource. In OpenAPI we can list the Accept media types. Using films in resource embedding, we could say it accepts:

This would only make sense if the embedding were to take place via accept header / mediatype. I don't think that would be correct. Adding an embedding will return a new entity - it will not merely change the return format. That means it does not belong in the header.

The gain here is that we could document the relationships a resource has. Which was previously impossible in OpenAPI.

Really what I suggested above about embeddings and the path component would be the way forward. Just requesting people is different from just requesting jobs is different from requesting a combination of both. If we had something like GET /people,jobs, then it would also be possible to add those new paths to the OpenAPI output.

wolfgangwalther avatar Jun 10 '23 14:06 wolfgangwalther

Change operator syntax from col=op.val to col.op=val

As another datapoint, this also helps when user input is passed to a stored procedure via GET.

Currently, query=foobar is processed correctly for CREATE FUNCTION search(query TEXT). But with query=in.foobar, Postgrest no longer interprets query as a function param, but as a filter.

There doesn't seem to be a simple way to force the function param interpretation (quoting required preprocessing in the stored procedure to remove the quotes).

erickedji avatar Jun 27 '23 09:06 erickedji

Adding an embedding will return a new entity - it will not merely change the return format. That means it does not belong in the header.

Just noticed that even the Prefer: exact=count hints that the above is true. The count is different when embedding is done (see https://github.com/PostgREST/postgrest/issues/2009 and https://github.com/PostgREST/postgrest/issues/2846). So indeed it's a new entity and doesn't belong in the header, makes sense because people,jobs can be represented in json, csv or other media types.

GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss

This would also play nicely with the ;single or ;one parameter for requesting a single entity, which is kind of similar to a hint of relationship type to embedd !m2m, !o2m, etc.: GET /people;one,jobs;o2m?...

Now that we're moving on from doing disambiguation on the URL(https://github.com/PostgREST/postgrest/issues/2863), the syntax can be much simpler so we don't need the FK or the cardinality in the path. So we could just go with GET /people,jobs.


There's a problem in expressing nested embedding on the path, since it needs to shape the JSON output. For this I believe we should keep the select parameter.

GET /actors,roles,films?select=roles(character,films(title,year))

Or later as a media type parameter too:

GET /actors,roles,films

Accept: application/json;select=roles(character,films(title,year))

If we had something like GET /people,jobs, then it would also be possible to add those new paths to the OpenAPI output.

Yes, and we could do this now without a breaking change.


Another syntax:

GET /users(projects,tasks(subtasks))

That would allow us to express joins on resources.

steve-chavez avatar Jul 18 '23 20:07 steve-chavez

Adding an embedding will return a new entity - it will not merely change the return format.

If doing an embedding (a foreign key join) creates a new entity, what about an aggregate like SUM, AVG, etc? I mean it has the potential to reduce a REST collection to a single resource.

I'm mostly trying to see if we could do away without adding GROUP BY.

Been checking project m36 aggregates and they say:

SQL supports aggregate queries using aggregate functions (SUM(), COUNT(), AVERAGE(), etc.) and GROUP BY while the aggregate functions in the relational algebra simply accept relation values as arguments.

But I haven't had the chance to play it with yet (their nix-shell took ages), so not sure if they don't really need GROUP BY.

steve-chavez avatar Aug 30 '23 19:08 steve-chavez

PRQL has a good approach for this (ref): aggregates don't change the number of rows by default but only add a new column using window functions:

https://prql-lang.org/book/reference/stdlib/transforms/aggregate.html#aggregate-is-required

This will prevent failure from clients whenever they forget to add GROUP BY.


So I think the above answers the previous question:

If doing an embedding (a foreign key join) creates a new entity, what about an aggregate like SUM, AVG, etc? I mean it has the potential to reduce a REST collection to a single resource.

An aggregate doesn't have to change the amount of rows. We can say that GROUP BY is a filter in a way.

steve-chavez avatar Aug 30 '23 19:08 steve-chavez

Another improvement for the query language: use ~ for not. It's shorter/clearer (likely easier for the parsers) and it doesn't require urlencoding: https://perishablepress.com/stop-using-unsafe-characters-in-urls/

/a?col.~eq=3

steve-chavez avatar Aug 30 '23 20:08 steve-chavez