graphql-engine
graphql-engine copied to clipboard
Disk IOPS growing out of control as event tables grow
We run our database on RDS (AWS) using their SSD storage option. As described here, I/O operations are a budgeted resource with a slowly-replenishing credit balance. If you run out of IOPS credits, your performance tanks.
Recently our Postgres read IOPS has started growing out of control. I wasn't sure how to investigate this (tips are welcome!) but we eventually truncated hdb_catalog.event_invocation_logs and dropped all delivered || errored rows from hdb_catalog.event_log. read IOPS dropped drastically.
I think this may be a documentation bug; there is no expectation that Hasura Actions perform well when a core table reaches >1M rows, but I think it would be nice to discuss the specific risk somewhere in the documentation.
I'm also wondering what specific mechanism is causing this huge amount of I/O reads. In RDS Performance Insights we saw that the following query was spending a lot of time waiting on DataFileRead, which suggests it may have been causing a lot of read I/O:
UPDATE hdb_catalog.event_log
SET locked = ?
WHERE id IN ( SELECT l.id
FROM hdb_catalog.event_log l
WHERE l.delivered = ? and l.error = ? and l.locked = ?
and (l.next_retry_at is NULL or l.next_retry_at <= now())
and l.archived = ?
ORDER BY created_at
LIMIT $1
FOR UPDATE SKIP LOCKED )
RETURNING id, schema_name, table_name,
I'm thinking maybe the problem is lock contention?
I think it is because of inefficient indexes on hdb_catalog.event_log table that is causing more than required disk IOPS. We have fixed this in main recently and should be out in the next release.
That's very exciting. Out of curiosity could you link the specific changeset? I took a quick look at the indexes and didn't see anything alarming.
Here it is: https://github.com/hasura/graphql-engine/commit/01d8a377063ae97c09f4668c71474d99fed6b3ca
I wouldn't recommend EXPLAIN ANALYZE on the update query as it would mutate data but on a test machine, you could try it and with the older index you would see lot of rows read and filtered out because it isn't a very selective index. With the new index, it reads a bounded number of rows even when the table grows unbounded.
has this been resolved? are there any new updates on this?
@iangabrielsanchez Yeah, new indexes were added in v2.0 so this issue can be considered resolved.
If you have any performance problems with event triggers or table size, pls do give some more details: how many rows are there in your hdb_catalog.event_log table, how many concurrent events are you generating, etc? Also, it is recommended that you regularly clean up your event tables either manually or automatically: https://hasura.io/docs/latest/event-triggers/clean-up/
@tirumaraiselvan we have a relatively large events table of +-20M records. We enabled the automatic cleanup, but because there is no index on the created_at column, the query takes long and eventually times out. I understand why the partial index was added in https://github.com/hasura/graphql-engine/commit/01d8a377063ae97c09f4668c71474d99fed6b3ca (to allow hasura to efficiently query for unprocessed events), but should there not be an index on created_at to allow for the automatic cleanup to run more efficiently. For reference, this is the query that hasura is running to perform the auto cleanup:
SELECT id FROM hdb_catalog.event_log
WHERE ((delivered = true OR error = true) AND trigger_name = $1)
AND created_at < now() - interval '720 hours'
AND locked IS NULL
LIMIT $2
there is no usable index for that query (besides for on trigger_name), which results in extremely slow performance. Likewise, when trying to view the latest hasura events in the hasura console, the performance in loading results is too slow to be usable.