pygeoapi icon indicating copy to clipboard operation
pygeoapi copied to clipboard

PostgreSQL DISTINCT slows API down to 20 seconds per request

Open geekdenz opened this issue 3 years ago • 11 comments

Description Since a few version it seems like the API is super slow. A request for 10 items takes about 20 seconds to load. I checked the geometries and they are not complex.

Steps to Reproduce Steps to reproduce the behavior: Create a view with many columns off a table that has geometries in another projection in the original table.

Expected behavior The page to load in less than 5 seconds.

Screenshots/Tracebacks I found the issue. It is here: https://github.com/geopython/pygeoapi/blob/master/pygeoapi/provider/postgresql.py#L256 The DISTINCT in the query slows it down by a factor of about 20. It takes 1 second to render a remote page when I remove the DISTINCT. I don't think the DISTINCT is necessary for it being correct as the primary key should imply uniqueness.

Environment

  • OS: Linux
  • Python version: 3 (what is in current Docker container)
  • pygeoapi version: latest

Additional context Add any other context about the problem here.

geekdenz avatar Sep 19 '20 05:09 geekdenz

@francbartoli any ideas?

tomkralidis avatar Sep 19 '20 18:09 tomkralidis

@tomkralidis it had fixed duplicates on query with bbox in query parameter

francbartoli avatar Sep 19 '20 20:09 francbartoli

Is DISTINCT necessary to fix the duplicates in the extent query? Seems like duplicates should not be caused by an extent query. The reason DISTINCT causes this might be because I have a view and it computes the geom column (reprojects it). We are using a DB cursor which I don't think is ideal either and it probably has to query the whole table to satisfy the DISTINCT and therefore computes the projected geom data for every limit of even just 10.

geekdenz avatar Sep 21 '20 00:09 geekdenz

Having said that, I tried with a materialized view and still had the problem. The solution was to remove DISTINCT.

geekdenz avatar Sep 21 '20 00:09 geekdenz

I'm also finding the same issue: the DISTINCT on this line (using the currently latest commit: eb891010) is causing all my calls to this endpoint to time out. I'm also using a materialized view.

I'm happy to make a PR to remove it, but I'm not enough of a SQL expert to know whether there are sensible reasons to leave it in. Any thoughts @francbartoli ? Are you able to explain the problem that DISTINCT was solving?

jemrobinson avatar Feb 10 '22 16:02 jemrobinson

Maybe a more complicated query would fix this and the thing that distinct solves. I don't have access to a computer right now but would be interested to see. A query that uses an index that is.

geekdenz avatar Feb 10 '22 19:02 geekdenz

@geekdenz @jemrobinson I can't recall exactly what was the dataset that made emerging the problem with duplicates but it was , for sure, related to a bbox query. We can revert that commit https://github.com/geopython/pygeoapi/commit/0ae20848f3ae52dd8b8ae73bd4a57b44fbd8671d and then come up with a test to reproduce and fix the problem

francbartoli avatar Feb 10 '22 21:02 francbartoli

@francbartoli @jemrobinson @geekdenz @jorgejesus I have been using pygeoapi for a short time and am facing slowness problems on large tables. After research, I realized that it is the DISTINCT of the query that is in question and came across this post. Adding the DISTINCT causes slowness issues for the reasons discussed earlier: postgres cursors can't work optimally with a DISTINCT because the query needs to be globally evaluated to be able to deduplicate records. In the commit history, on 02/24 3caf4a9 there was the addition of the distinct. It was removed in the next commit on 2/25, the next day d007eb8, then added again by @francbartoli on 3/3 for GDAL3 support 72b9a64 Without mastering the concerns related to the bbox parameter, I don't think that adding the DISTINCT is the right solution and I suggest going back to the query without DISTINCT. Is it possible ? Thank you Nicholas.

nicolasvri avatar Sep 13 '22 13:09 nicolasvri

@nicolasvri let's wait for the merge of #964 that will refactor the backend provider by using SQLAlchemy

francbartoli avatar Sep 29 '22 18:09 francbartoli

If #964 did not resolve it, maybe putting a sorted index on the selected columns will speed it up.

geekdenz avatar Oct 02 '22 07:10 geekdenz

I just ran into this problem with a database table with 3.2 million rows - just loading the items view takes more than 20 seconds.

I did a very quick test by creating a local custom postgresql.py provider - essentially replacing DISTINCT with DISTINCT ON. I ran pygeoapi in docker with this change and now the items view loads immediately. I also tested the sortby functionality with other fields after this change and these queries also seemed to work. These other fields also have simple indexes.

Below is a summary of the changes I made while testing.

def _make_distinct(self, sortby):
    ret = [self.id_field] if self.id_field not in sortby else []
    for _ in sortby:
        ret.append(_["property"])
    return SQL(",".join(ret))

Call function before sortby is modified and add value to query.

distinct = self._make_distinct(sortby)
orderby = self._make_orderby(sortby) if sortby else SQL("")

sql_query = SQL(
    'DECLARE "geo_cursor" CURSOR FOR \
    SELECT DISTINCT ON({}) {} {} FROM {} {} {}'
).format(
    distinct,
    props,
    geom,
    Identifier(self.table),
    where_clause,
    orderby,
)

Refs: https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT

solismaam avatar Oct 15 '22 15:10 solismaam

As per RFC4, this Issue has been inactive for 90 days. In order to manage maintenance burden, it will be automatically closed in 7 days.

github-actions[bot] avatar Mar 10 '24 21:03 github-actions[bot]

As per RFC4, this Issue has been closed due to there being no activity for more than 90 days.

github-actions[bot] avatar Mar 24 '24 03:03 github-actions[bot]