pgstac Occasional syntax errors from well formed search queries

This is quite a tricky one to reproduce. We are running pgstac in production and very occasionally get pq: syntax error at end of input errors from pgstac search queries. The queries look well formed when we log them.

Looking at the database logs I can see:

STATEMENT:  SELECT search($1::jsonb);
ERROR:  syntax error at end of input at character 49
QUERY:  EXPLAIN (format json) SELECT 1 FROM items WHERE
CONTEXT:  PL/pgSQL function where_stats(text,boolean,jsonb) line 103 at EXECUTE
PL/pgSQL function search(jsonb) line 31 at assignment

We are running version 0.8.1, so that line is here - https://github.com/stac-utils/pgstac/blob/v0.8.1/src/pgstac/sql/004_search.sql#L497

I stared at this for a while and couldn't figure out how that where clause could be an empty string, so I started trying to reproduce it locally.

I ended up running 0.8.2 in a container with:

docker run \
-e POSTGRES_DB=stac \
-e POSTGRES_PASSWORD=admin \
-e POSTGRES_USER=admin \
-e PGUSER=admin \
-e PGPASSWORD=admin \
-e PGDATABASE=stac \
-p 5432:5432 \
ghcr.io/stac-utils/pgstac:v0.8.2 \
postgres -c 'log_statement=all'

Then:

Install Golang if you don't have it already
Create a new directory
Drop the following file into main.go

package main

import (
	"database/sql"
	"encoding/json"
	"fmt"
	"log"
	"sync"

	_ "github.com/lib/pq"
)

const (
	host     = "localhost"
	port     = 5432
	user     = "admin"
	password = "admin"
	dbname   = "stac"
)

// docker run -e POSTGRES_DB=stac -e POSTGRES_PASSWORD=admin -e POSTGRES_USER=admin -e PGUSER=admin -e PGPASSWORD=admin -e PGDATABASE=stac -p 5432:5432 ghcr.io/stac-utils/pgstac:v0.8.2 postgres -c 'log_statement=all'

func main() {
	psqlInfo := fmt.Sprintf("host=%s port=%d user=%s "+
		"password=%s dbname=%s sslmode=disable",
		host, port, user, password, dbname)

	db, err := sql.Open("postgres", psqlInfo)
	if err != nil {
		log.Fatal(err)
	}
	defer db.Close()

	log.Println("Successfully connected!")

	query := `SELECT search('{"limit": 1}'::jsonb);`
	log.Printf("Search query: %#v", query)

	wg := sync.WaitGroup{}

	for i := 0; i < 2; i++ {
		wg.Add(1)
		go func(threadID int) {
			row := db.QueryRow(query)
			err = row.Err()
			if err != nil {
				log.Fatalf("[%d] Error: %v", threadID, err)
			}

			var item json.RawMessage
			err = row.Scan(&item)
			if err != nil {
				log.Fatalf("[%d] Error: %v", threadID, err)
			}

			log.Printf("[%d] Result: %s", threadID, string(item))

			wg.Done()
		}(i)
	}
	wg.Wait()
}

go mod init pgstac_repro
go get github.com/lib/pq
go run main.go

This will spin up two threads and try to run the same search simultaneously on both. For me this gives the following pretty reliably:

2024/01/18 22:34:44 Successfully connected!
2024/01/18 22:34:44 Search query: "SELECT search('{\"limit\": 1}'::jsonb);"
2024/01/18 22:34:44 [1] Result: {"next": null, "prev": null, "type": "FeatureCollection", "context": {"limit": 1, "returned": 0}, "features": []}
2024/01/18 22:34:44 [0] Error: pq: syntax error at or near "LIMIT"
exit status 1

One of them succeeds, the other gives a syntax error with the same query.

The postgres logs show

2024-01-18 22:34:44.440 UTC [96] STATEMENT:  SELECT search('{"limit": 1}'::jsonb);
2024-01-18 22:34:44.442 UTC [96] LOG:  QUERY:
                SELECT * FROM items
                WHERE  TRUE
                ORDER BY
                LIMIT '2'

2024-01-18 22:34:44.442 UTC [96] CONTEXT:  PL/pgSQL function search_rows(text,text,text[],integer) line 82 at RAISE
        SQL statement "SELECT jsonb_agg(format_item(i, _fields, hydrate))                      FROM search_rows(
                full_where,
                orderby,
                search_where.partitions,
                _querylimit
            ) as i"
        PL/pgSQL function search(jsonb) line 69 at SQL statement
2024-01-18 22:34:44.442 UTC [96] STATEMENT:  SELECT search('{"limit": 1}'::jsonb);
2024-01-18 22:34:44.443 UTC [96] ERROR:  syntax error at or near "LIMIT" at character 77
2024-01-18 22:34:44.443 UTC [96] QUERY:
                SELECT * FROM items
                WHERE  TRUE
                ORDER BY
                LIMIT '2'

So, a different error but similar behaviour.

Again, I can't see exactly where the code this could be happening, but rerunning the same program again will not trigger it, you need to either change the query or restart the database to trigger the bug again. I guess this means the problem is related to the query cache/statistics tables.

Any help would be appreciated. Let me know if there are any other details I can provide.

Jan 18 '24 22:01 mirober

I bisected this. On 0.7.7 running the above code succeeds, and looking at the searches table I can see that the usecount for the search is 2 as expected. On 0.7.9 it fails and usecount is only 1. I think this means it was introduced in https://github.com/stac-utils/pgstac/pull/183/commits/194b08b9094830734def3902fc0b8caa21686117#diff-cd06c27f5f0a39c747525cce5814f41f9bc0245fe8317c7c8f163984bd8dcdea

Jan 20 '24 18:01 mirober

Just for the record, we're experiencing this issue as well.

Apr 14 '24 22:04 drnextgis

@mrob95 So far I have not been able to replicate this using your script (or just running in parallel using the following bash script) even cranking the number of parallel processes up to 12.

#!/bin/bash
q="SELECT count(*) from search('{\"limit\":1}');"
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &

If anything, I would guess that that bit of code that you found is the culprit. Can you see if you can replicate if you change the search_query function out to this which should ensure that the search variable doesn't get overwritten?

CREATE OR REPLACE FUNCTION search_query(
    _search jsonb = '{}'::jsonb,
    updatestats boolean = false,
    _metadata jsonb = '{}'::jsonb
) RETURNS searches AS $$
DECLARE
    search searches%ROWTYPE;
    cached_search searches%ROWTYPE;
    pexplain jsonb;
    t timestamptz;
    i interval;
    _hash text := search_hash(_search, _metadata);
    doupdate boolean := FALSE;
    insertfound boolean := FALSE;
    ro boolean := pgstac.readonly();
BEGIN
    IF ro THEN
        updatestats := FALSE;
    END IF;

    SELECT * INTO search FROM searches
    WHERE hash=_hash;

    search.hash := _hash;

    -- Calculate the where clause if not already calculated
    IF search._where IS NULL THEN
        search._where := stac_search_to_where(_search);
    ELSE
        doupdate := TRUE;
    END IF;

    -- Calculate the order by clause if not already calculated
    IF search.orderby IS NULL THEN
        search.orderby := sort_sqlorderby(_search);
    ELSE
        doupdate := TRUE;
    END IF;

    PERFORM where_stats(search._where, updatestats, _search->'conf');

    IF NOT ro THEN
        IF NOT doupdate THEN
            INSERT INTO searches (search, _where, orderby, lastused, usecount, metadata)
            VALUES (_search, search._where, search.orderby, clock_timestamp(), 1, _metadata)
            ON CONFLICT (hash) DO NOTHING RETURNING * INTO cached_search;
            IF FOUND THEN
                RETURN cached_search;
            END IF;
        END IF;

        UPDATE searches
            SET
                lastused=clock_timestamp(),
                usecount=usecount+1
        WHERE hash=(
            SELECT hash FROM searches
            WHERE hash=_hash
            FOR UPDATE SKIP LOCKED
        );
        IF NOT FOUND THEN
            RAISE NOTICE 'Did not update stats for % due to lock. (This is generally OK)', _search;
        END IF;
    END IF;

    RETURN search;

END;
$$ LANGUAGE PLPGSQL SECURITY DEFINER;

Apr 16 '24 20:04 bitner

cc @drnextgis :point_up:

Apr 16 '24 20:04 bitner

Thanks, @bitner. I'll give it a look. I can replicate the issue easily following @mrob95's suggestion:

$ go run main.go                                
2024/04/17 10:53:21 Successfully connected!
2024/04/17 10:53:21 Search query: "SELECT search('{\"limit\": 1}'::jsonb);"
2024/04/17 10:53:21 [0] Result: {"next": null, "prev": null, "type": "FeatureCollection", "context": {"limit": 1, "returned": 0}, "features": []}
2024/04/17 10:53:21 [1] Error: pq: syntax error at or near "LIMIT"
exit status 1

This occurs right after the initial execution. To reproduce it, I need to clear the searches table.

Apr 17 '24 08:04 drnextgis

@bitner, with the changes you suggested, I can't seem to reproduce the issue, so it looks like it did the trick. However, I have a question. After the initial execution, the value of usecount is 2:

stac=# select * from searches;
               hash               |    search    | _where |        orderby         |           lastused           | usecount | metadata 
----------------------------------+--------------+--------+------------------------+------------------------------+----------+----------
 c53f4ebe9b2a50bc2b52fd88a5d503e1 | {"limit": 1} |  TRUE  | datetime DESC, id DESC | 2024-04-17 08:59:49.10575+00 |        2 | {}
(1 row)

But after the subsequent execution, it increases by 1. Shouldn't it increase by 2?

stac=# select * from searches;
               hash               |    search    | _where |        orderby         |           lastused            | usecount | metadata 
----------------------------------+--------------+--------+------------------------+-------------------------------+----------+----------
 c53f4ebe9b2a50bc2b52fd88a5d503e1 | {"limit": 1} |  TRUE  | datetime DESC, id DESC | 2024-04-17 09:00:04.022792+00 |        3 | {}
(1 row)

Apr 17 '24 09:04 drnextgis

Ideally, it should be starting at 1 and increase by 1 each time the same query comes. I do have a "skip locked" in the update, so if there is contention, it will just skip updating the usecount. The usecount is really a "nice to have" to get a general sense of what queries are being used the most and at least for my use cases, I'd rather just not update that column than need to wait for any locks to go away when there are concurrent identical queries.

Apr 17 '24 14:04 bitner

Thanks for tracking this down!

Apr 20 '24 22:04 mirober

@bitner, is there a way to prevent the searches table from growing infinitely, or should it be periodically cleaned up manually?

Apr 24 '24 15:04 drnextgis

@drnextgis I think that depends on the use case. If you are using with titiler-pgstac or anything else that is using the query hash to look up a query, clearing out the searches table would prevent being able to look those up. If you know that you are generally setting the metadata field when registering a search, you can likely safely get rid of anything that has metadata = {} or is null. The search_wheres table (if you are using context) can safely be cleared out on a cron using something like DELETE FROM search_wheres WHERE statslastupdated < now() - context_stats_ttl;

Apr 24 '24 16:04 bitner

pgstac pgstac copied to clipboard

Occasional syntax errors from well formed search queries

pgstac
pgstac copied to clipboard