postgrest icon indicating copy to clipboard operation
postgrest copied to clipboard

`exists` filter - When embedding with top-level filtering (inner join), empty parentheses should be allowed

Open Pigeo opened this issue 3 years ago • 7 comments

Environment

  • PostgreSQL version: 12.10 (Debian 12.10-1.pgdg100+1)
  • PostgREST version: 9.0.1 (linux-static-x64)
  • Operating system: Debian Buster (10.12)

Description of issue

Trying to do :

http://127.0.0.1:3000/film?select=*,film_category!inner()

I get an error message column film.film_category does not exist.

Firstly, the error message is out of touch, since we are performing an INNER JOIN and the columns / relation does exist.

Secondly, it should not trigger an error: this query is perfectly legitimate and feasible ("I want all the films that do have a category, but I'm not interested in knowing what are their precise categories, therefore I would like to have a reduced JSON, not containing the film_category data, i.e. just filtering the top-level on "has a film_category", but not embedding the data)

What I have to do as a work-around: put any dummy column inside the parentheses, so that I don't get an error message anymore:

http://127.0.0.1:3000/film?select=*,film_category!inner(film_id)

And then I get a fat JSON body response, where I'm trashing all this embedded data... (waste of bandwidth + slow web app because of that)

Steps to reproduce

Let's take a sample database from https://www.postgresqltutorial.com/postgresql-sample-database/. The relevant schema is as follows:

CREATE TABLE public.film (
    film_id integer DEFAULT nextval('public.film_film_id_seq'::regclass) NOT NULL,
    title character varying(255) NOT NULL,
    description text,
    release_year public.year,
    language_id smallint NOT NULL,
    rental_duration smallint DEFAULT 3 NOT NULL,
    rental_rate numeric(4,2) DEFAULT 4.99 NOT NULL,
    length smallint,
    replacement_cost numeric(5,2) DEFAULT 19.99 NOT NULL,
    rating public.mpaa_rating DEFAULT 'G'::public.mpaa_rating,
    last_update timestamp without time zone DEFAULT now() NOT NULL,
    special_features text[],
    fulltext tsvector NOT NULL
);
CREATE TABLE public.category (
    category_id integer DEFAULT nextval('public.category_category_id_seq'::regclass) NOT NULL,
    name character varying(25) NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);
CREATE TABLE public.film_category (
    film_id smallint NOT NULL,
    category_id smallint NOT NULL,
    last_update timestamp without time zone DEFAULT now() NOT NULL
);
ALTER TABLE ONLY public.film_category
    ADD CONSTRAINT film_category_category_id_fkey FOREIGN KEY (category_id) REFERENCES public.category(category_id) ON UPDATE CASCADE ON DELETE RESTRICT;
ALTER TABLE ONLY public.film_category
    ADD CONSTRAINT film_category_film_id_fkey FOREIGN KEY (film_id) REFERENCES public.film(film_id) ON UPDATE CASCADE ON DELETE RESTRICT;

Populate those 3 tables as provided on this page, but then remove the category of some films, for example with:

DELETE FROM film_category WHERE category_id > 1; -- only 64 films will keep their category

Now suppose you're interested to know what are those 64 films which still has a category, but you don't want to embed their categories.

Suggestions / Discussion

If ever it's only a problem of syntax / parsing of the query string, not allowing any empty parentheses, then may I suggest to use a special keyword that would mean "no columns at all". It could be some reserved keyword from the SQL syntax, such as "NULL", so that we're pretty sure nobody would ever have a column with such a name in his/her database:

http://127.0.0.1:3000/film?select=*,film_category!inner(NULL)

(Currently says "column film_category.NULL does not exist", which is true)

Pigeo avatar Jun 20 '22 14:06 Pigeo

And then I get a fat JSON body response, where I'm trashing all this embedded data... (waste of bandwidth + slow web app because of that)

I see your point.

http://127.0.0.1:3000/film?select=*,film_category!inner()

We were planning to change the !inner syntax to exists, this would allow omitting film_category in the select. So how about:

GET /film?exists=film_category

steve-chavez avatar Jun 20 '22 22:06 steve-chavez

I also need to do things like that:

http://127.0.0.1:3000/film?select=*,film_category!inner()&film_category.category_id=not.eq.1

(Actually, that's the vast majority of my use cases. The former one, without any additional filter criteria, is rarer, and was cited here for simplification purposes)

As far as I understand how PostgREST works, currently I need to include film_category!inner() in the select=…, otherwise my filter criteria film_category.category_id=not.eq.1 is simply ignored and discarded without any notification or any error message.

How would you do that with your proposed syntax ?

(It would be great if PostgREST could automatically infer the necessity to perform an INNER JOIN from the bare presence of a filter criteria starting with my_other_table_name.column_name=…)

Pigeo avatar Jun 21 '22 09:06 Pigeo

To give some figures, in my production database, the equivalent of the film table in the above example contains ~1,000 rows, and the equivalent of the film_category table above contains ~1,000,000 rows (because it's not a N-to-N table).

Therefore, when using this work-around:

GET http://127.0.0.1:3000/film?select=*,film_category!inner(film_id) HTTP 1/1

I get a 12 MB JSON response, containing thousands of useless lines...

Whereas with a theoretical:

GET http://127.0.0.1:3000/film?select=*,film_category!inner() HTTP 1/1

The JSON response would be only 80 KB...

So, that makes a great difference!

Pigeo avatar Jun 21 '22 09:06 Pigeo

And then I get a fat JSON body response, where I'm trashing all this embedded data... (waste of bandwidth + slow web app because of that)

I see your point.

http://127.0.0.1:3000/film?select=*,film_category!inner()

We were planning to change the !inner syntax to exists, this would allow omitting film_category in the select. So how about:

GET /film?exists=film_category

Thinking about it once again: if you try to make an analogy with the SQL syntax, then I understand that the select=… part of the query should not change which rows are retrieved or not, and it should only change which columns are presents or not in the response body, and as such, the INNER JOIN should be located elsewhere than in the select=…. But the same argument can made about the WHERE clause (or what resembles it) : I'm not sure it's a good idea to rely on the content of the WHERE clause to determine if an INNER JOIN should be performed or not (to me, something like exists=film_category clearly is something similar to what can be found in the WHERE clause of an SQL query).

Maybe it is not for nothing that there is also a FROM clause in the SQL syntax... Perhaps you should consider adding some equivalent of the FROM clause for specifying which INNER JOINs should be performed?

Pigeo avatar Jun 21 '22 12:06 Pigeo

I also need to do things like that: http://127.0.0.1:3000/film?select=*,film_category!inner()&film_category.category_id=not.eq.1 How would you do that with your proposed syntax ?

Yes, that would also be possible with exists:

GET /film?exists=film_category&film_category.category_id=not.eq.1

As far as I understand how PostgREST works, currently I need to include film_category!inner() in the select=…, otherwise my filter criteria film_category.category_id=not.eq.1 is simply ignored and discarded without any notification or any error message.

That one was fixed in https://github.com/PostgREST/postgrest/pull/2133, v10 will include it.

I'm not sure it's a good idea to rely on the content of the WHERE clause to determine if an INNER JOIN should be performed or not (to me, something like exists=film_category clearly is something similar to what can be found in the WHERE clause of an SQL query).

Check the samples of the generated queries in https://github.com/PostgREST/postgrest/pull/1978#issue-1026866522, it's actually a LATERAL JOIN plus json_agg. For now the best syntax we could come up is the exists one, but feel free to suggest another one that's clearer.


One other thing we could try to support with exists is doing OR across different tables: https://github.com/PostgREST/postgrest/discussions/2014#discussioncomment-1598918

steve-chavez avatar Jun 21 '22 17:06 steve-chavez

Just noted that the anti-join mentioned on https://github.com/PostgREST/postgrest/pull/1949#issuecomment-922875327 could be supported with a not.exists filter.

We could also add an /tbl?select=*,embed!anti(*) option similar to the !inner but I don't think it would make sense because the embed would always be empty.

steve-chavez avatar Aug 15 '22 20:08 steve-chavez

Also considering #1949 (comment), that also suggests to just go with the non-breaking changes for now - and then open another issue where we can discuss a "new query syntax" that could get a few things straight:

  • I understand you're thinking about a new query syntax, for the long term, that would introduce a breaking change (for ex. by adding a new exists filter)
  • on the other hand, my initial request was just to relax a little bit the constraints on the current query syntax, so that empty parentheses should be allowed for embeddings. This does not introduce any breaking change, and thus could be done as of now, without having to wait for a new query syntax with breaking changes.

Since both things are independent of each other, would you please consider allowing empty parenthesis in the actual query syntax? (is it difficult to implement? Does it breaks anything?) Or do you expect the breaking new query syntax to be shipped in production so quickly that it is not worth correcting the current syntax?

Thank you for considering this possibility.

Pigeo avatar Aug 17 '22 09:08 Pigeo

Since both things are independent of each other, would you please consider allowing empty parenthesis in the actual query syntax?

Yes, using the ideas on https://github.com/PostgREST/postgrest/issues/1414#issuecomment-1256546536 now I think an empty parenthesis grants us a simpler/better syntax than exists. We could even replace the !inner with:

GET /projects?select=*,clients()&clients=not.is.null

Does it breaks anything?

This would not be a breaking change since empty parenthesis right now gives a syntax error

curl 'localhost:3000/projects?select=*,clients()'

{"code":"42703","details":null,"hint":"Perhaps you meant to reference the column \"projects.client_id\".","message":"column projects.clients does not exist"}

steve-chavez avatar Sep 23 '22 18:09 steve-chavez

Great. Then we can go on this.

Pigeo avatar Sep 24 '22 13:09 Pigeo