pgstac
pgstac copied to clipboard
Occasional syntax errors from well formed search queries
This is quite a tricky one to reproduce. We are running pgstac in production and very occasionally get pq: syntax error at end of input errors from pgstac search queries. The queries look well formed when we log them.
Looking at the database logs I can see:
STATEMENT: SELECT search($1::jsonb);
ERROR: syntax error at end of input at character 49
QUERY: EXPLAIN (format json) SELECT 1 FROM items WHERE
CONTEXT: PL/pgSQL function where_stats(text,boolean,jsonb) line 103 at EXECUTE
PL/pgSQL function search(jsonb) line 31 at assignment
We are running version 0.8.1, so that line is here - https://github.com/stac-utils/pgstac/blob/v0.8.1/src/pgstac/sql/004_search.sql#L497
I stared at this for a while and couldn't figure out how that where clause could be an empty string, so I started trying to reproduce it locally.
I ended up running 0.8.2 in a container with:
docker run \
-e POSTGRES_DB=stac \
-e POSTGRES_PASSWORD=admin \
-e POSTGRES_USER=admin \
-e PGUSER=admin \
-e PGPASSWORD=admin \
-e PGDATABASE=stac \
-p 5432:5432 \
ghcr.io/stac-utils/pgstac:v0.8.2 \
postgres -c 'log_statement=all'
Then:
- Install Golang if you don't have it already
- Create a new directory
- Drop the following file into
main.go
package main
import (
"database/sql"
"encoding/json"
"fmt"
"log"
"sync"
_ "github.com/lib/pq"
)
const (
host = "localhost"
port = 5432
user = "admin"
password = "admin"
dbname = "stac"
)
// docker run -e POSTGRES_DB=stac -e POSTGRES_PASSWORD=admin -e POSTGRES_USER=admin -e PGUSER=admin -e PGPASSWORD=admin -e PGDATABASE=stac -p 5432:5432 ghcr.io/stac-utils/pgstac:v0.8.2 postgres -c 'log_statement=all'
func main() {
psqlInfo := fmt.Sprintf("host=%s port=%d user=%s "+
"password=%s dbname=%s sslmode=disable",
host, port, user, password, dbname)
db, err := sql.Open("postgres", psqlInfo)
if err != nil {
log.Fatal(err)
}
defer db.Close()
log.Println("Successfully connected!")
query := `SELECT search('{"limit": 1}'::jsonb);`
log.Printf("Search query: %#v", query)
wg := sync.WaitGroup{}
for i := 0; i < 2; i++ {
wg.Add(1)
go func(threadID int) {
row := db.QueryRow(query)
err = row.Err()
if err != nil {
log.Fatalf("[%d] Error: %v", threadID, err)
}
var item json.RawMessage
err = row.Scan(&item)
if err != nil {
log.Fatalf("[%d] Error: %v", threadID, err)
}
log.Printf("[%d] Result: %s", threadID, string(item))
wg.Done()
}(i)
}
wg.Wait()
}
go mod init pgstac_reprogo get github.com/lib/pqgo run main.go
This will spin up two threads and try to run the same search simultaneously on both. For me this gives the following pretty reliably:
2024/01/18 22:34:44 Successfully connected!
2024/01/18 22:34:44 Search query: "SELECT search('{\"limit\": 1}'::jsonb);"
2024/01/18 22:34:44 [1] Result: {"next": null, "prev": null, "type": "FeatureCollection", "context": {"limit": 1, "returned": 0}, "features": []}
2024/01/18 22:34:44 [0] Error: pq: syntax error at or near "LIMIT"
exit status 1
One of them succeeds, the other gives a syntax error with the same query.
The postgres logs show
2024-01-18 22:34:44.440 UTC [96] STATEMENT: SELECT search('{"limit": 1}'::jsonb);
2024-01-18 22:34:44.442 UTC [96] LOG: QUERY:
SELECT * FROM items
WHERE TRUE
ORDER BY
LIMIT '2'
2024-01-18 22:34:44.442 UTC [96] CONTEXT: PL/pgSQL function search_rows(text,text,text[],integer) line 82 at RAISE
SQL statement "SELECT jsonb_agg(format_item(i, _fields, hydrate)) FROM search_rows(
full_where,
orderby,
search_where.partitions,
_querylimit
) as i"
PL/pgSQL function search(jsonb) line 69 at SQL statement
2024-01-18 22:34:44.442 UTC [96] STATEMENT: SELECT search('{"limit": 1}'::jsonb);
2024-01-18 22:34:44.443 UTC [96] ERROR: syntax error at or near "LIMIT" at character 77
2024-01-18 22:34:44.443 UTC [96] QUERY:
SELECT * FROM items
WHERE TRUE
ORDER BY
LIMIT '2'
So, a different error but similar behaviour.
Again, I can't see exactly where the code this could be happening, but rerunning the same program again will not trigger it, you need to either change the query or restart the database to trigger the bug again. I guess this means the problem is related to the query cache/statistics tables.
Any help would be appreciated. Let me know if there are any other details I can provide.
I bisected this. On 0.7.7 running the above code succeeds, and looking at the searches table I can see that the usecount for the search is 2 as expected. On 0.7.9 it fails and usecount is only 1. I think this means it was introduced in https://github.com/stac-utils/pgstac/pull/183/commits/194b08b9094830734def3902fc0b8caa21686117#diff-cd06c27f5f0a39c747525cce5814f41f9bc0245fe8317c7c8f163984bd8dcdea
Just for the record, we're experiencing this issue as well.
@mrob95 So far I have not been able to replicate this using your script (or just running in parallel using the following bash script) even cranking the number of parallel processes up to 12.
#!/bin/bash
q="SELECT count(*) from search('{\"limit\":1}');"
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &
psql -c "$q" &
If anything, I would guess that that bit of code that you found is the culprit. Can you see if you can replicate if you change the search_query function out to this which should ensure that the search variable doesn't get overwritten?
CREATE OR REPLACE FUNCTION search_query(
_search jsonb = '{}'::jsonb,
updatestats boolean = false,
_metadata jsonb = '{}'::jsonb
) RETURNS searches AS $$
DECLARE
search searches%ROWTYPE;
cached_search searches%ROWTYPE;
pexplain jsonb;
t timestamptz;
i interval;
_hash text := search_hash(_search, _metadata);
doupdate boolean := FALSE;
insertfound boolean := FALSE;
ro boolean := pgstac.readonly();
BEGIN
IF ro THEN
updatestats := FALSE;
END IF;
SELECT * INTO search FROM searches
WHERE hash=_hash;
search.hash := _hash;
-- Calculate the where clause if not already calculated
IF search._where IS NULL THEN
search._where := stac_search_to_where(_search);
ELSE
doupdate := TRUE;
END IF;
-- Calculate the order by clause if not already calculated
IF search.orderby IS NULL THEN
search.orderby := sort_sqlorderby(_search);
ELSE
doupdate := TRUE;
END IF;
PERFORM where_stats(search._where, updatestats, _search->'conf');
IF NOT ro THEN
IF NOT doupdate THEN
INSERT INTO searches (search, _where, orderby, lastused, usecount, metadata)
VALUES (_search, search._where, search.orderby, clock_timestamp(), 1, _metadata)
ON CONFLICT (hash) DO NOTHING RETURNING * INTO cached_search;
IF FOUND THEN
RETURN cached_search;
END IF;
END IF;
UPDATE searches
SET
lastused=clock_timestamp(),
usecount=usecount+1
WHERE hash=(
SELECT hash FROM searches
WHERE hash=_hash
FOR UPDATE SKIP LOCKED
);
IF NOT FOUND THEN
RAISE NOTICE 'Did not update stats for % due to lock. (This is generally OK)', _search;
END IF;
END IF;
RETURN search;
END;
$$ LANGUAGE PLPGSQL SECURITY DEFINER;
cc @drnextgis :point_up:
Thanks, @bitner. I'll give it a look. I can replicate the issue easily following @mrob95's suggestion:
$ go run main.go
2024/04/17 10:53:21 Successfully connected!
2024/04/17 10:53:21 Search query: "SELECT search('{\"limit\": 1}'::jsonb);"
2024/04/17 10:53:21 [0] Result: {"next": null, "prev": null, "type": "FeatureCollection", "context": {"limit": 1, "returned": 0}, "features": []}
2024/04/17 10:53:21 [1] Error: pq: syntax error at or near "LIMIT"
exit status 1
This occurs right after the initial execution. To reproduce it, I need to clear the searches table.
@bitner, with the changes you suggested, I can't seem to reproduce the issue, so it looks like it did the trick. However, I have a question. After the initial execution, the value of usecount is 2:
stac=# select * from searches;
hash | search | _where | orderby | lastused | usecount | metadata
----------------------------------+--------------+--------+------------------------+------------------------------+----------+----------
c53f4ebe9b2a50bc2b52fd88a5d503e1 | {"limit": 1} | TRUE | datetime DESC, id DESC | 2024-04-17 08:59:49.10575+00 | 2 | {}
(1 row)
But after the subsequent execution, it increases by 1. Shouldn't it increase by 2?
stac=# select * from searches;
hash | search | _where | orderby | lastused | usecount | metadata
----------------------------------+--------------+--------+------------------------+-------------------------------+----------+----------
c53f4ebe9b2a50bc2b52fd88a5d503e1 | {"limit": 1} | TRUE | datetime DESC, id DESC | 2024-04-17 09:00:04.022792+00 | 3 | {}
(1 row)
Ideally, it should be starting at 1 and increase by 1 each time the same query comes. I do have a "skip locked" in the update, so if there is contention, it will just skip updating the usecount. The usecount is really a "nice to have" to get a general sense of what queries are being used the most and at least for my use cases, I'd rather just not update that column than need to wait for any locks to go away when there are concurrent identical queries.
Thanks for tracking this down!
@bitner, is there a way to prevent the searches table from growing infinitely, or should it be periodically cleaned up manually?
@drnextgis I think that depends on the use case. If you are using with titiler-pgstac or anything else that is using the query hash to look up a query, clearing out the searches table would prevent being able to look those up. If you know that you are generally setting the metadata field when registering a search, you can likely safely get rid of anything that has metadata = {} or is null. The search_wheres table (if you are using context) can safely be cleared out on a cron using something like DELETE FROM search_wheres WHERE statslastupdated < now() - context_stats_ttl;