pgaudit icon indicating copy to clipboard operation
pgaudit copied to clipboard

question/idea: query_id and exec_time

Open ardentperf opened this issue 9 months ago • 8 comments

I'd be interested in having the query_id and exec_time available for inclusion in pgaudit records. Before I invest effort in putting together a patch, just wanted to check if there would be any strong pushback on the idea?

ardentperf avatar Feb 27 '25 22:02 ardentperf

Query ID might be interesting but I don't think exec time is appropriate for auditing.

But even for query ID I'd want to get more feedback from people who would find this useful. This is the first time it has been proposed and I'm not sure it would be widely applicable.

dwsteele avatar Apr 11 '25 16:04 dwsteele

I'm not sure if this is possible, there isn't a query id set yet at the point the statement logging hook executes; see https://www.postgresql.org/message-id/CAOBaU_YjMKang71GkxFfZeWDgHPmyUV%3DKtRGGiwDeyNkkJfsPQ%40mail.gmail.com for example.

gregs1104 avatar Apr 11 '25 16:04 gregs1104

I've been interested in query_id (insofar as it's stable) to compare performance, etc, when it comes to pg_stat_statements and be able to track metrics over time related to this. For pg_audit I guess I could see some sort of potential benefit in having some sort of identifier available, but in some ways this just seems like desired logging related to pg_stat_statements output, perhaps. Not certain I'd consider it the purview of pg_audit, but I get the idea.

pgguru avatar Apr 11 '25 16:04 pgguru

The reason query_id is important for auditing is because there are many cases where we can't guarantee the query text doesn't have sensitive data. Logging query_id gives us a little more information (alongside command tag and impacted objects) about the activity we're recording in the audit log - and maybe later we can even come back and get the query text. My thought process was that I might keep periodic snapshots of pg_stat_statements in a table inside the DB (safe place for it from a governance perspective), and while it's not guaranteed I could probably come back and get the query later in many cases if needed. And I can push query_id itself into external audit systems with 100% confidence that no sensitive data will be sent.

The exec_time falls into a similar category to row count, which is already available in pgaudit. My feelings aren't quite as strong on needing exec_time. But I think it might be useful in two cases: first, it's an extra data point that can be sent to a SIEM for anomaly detection (sometimes even ML based). second, even if not directly needed for auditing, in cases where we effectively log every query it would be nice to avoid needing to double the log output volume (with log_min_duration_statement) just to get this info.

The combination of both of these can make a nice setup for cases where everything needs to get logged and there are concerns about sensitive data. The audit log is guaranteed to have nothing sensitive and we don't need to double log volume to get duration.

ardentperf avatar Apr 11 '25 23:04 ardentperf

The reason query_id is important for auditing is because there are many cases where we can't guarantee the query text doesn't have sensitive data.

That makes sense to me.

My thought process was that I might keep periodic snapshots of pg_stat_statements in a table inside the DB (safe place for it from a governance perspective), and while it's not guaranteed I could probably come back and get the query later in many cases if needed.

And this is where it falls apart. Given the additional complex logic that would need to surround this feature I feel like it would be seldom if ever used -- but it would need to be maintained and tested.

Couldn't query_id be calculated from the audit log and then exported to external systems as needed? That way you would be guaranteed to have a mapping to the query text.

The exec_time falls into a similar category to row count, which is already available in pgaudit.

Row count serves a valid audit purpose -- exec_time does not that I can see. I think adding performance metrics to pgAudit sets a bad precedent.

dwsteele avatar Apr 14 '25 18:04 dwsteele

Couldn't query_id be calculated from the audit log

I'm pretty sure this is actually not possible - I think the only way to have query_id in the audit log is to make it available as a field in pgaudit.

ardentperf avatar Apr 26 '25 22:04 ardentperf

query jumbling code is at https://github.com/postgres/postgres/blob/master/src/backend/nodes/queryjumblefuncs.c

one example is that i'm pretty sure it uses OIDs rather than object names, and without knowing if search_path has changed within a session you can't be sure which OIDs to use

ardentperf avatar Apr 26 '25 22:04 ardentperf

I understand that query_id could be useful but the fact that you need to build a dictionary separately to make it work (without also logging the query) is going to make it very niche in my opinion (as in maybe just you). I think I'd need to see more general interest in the feature before I would consider adding it.

I'm definitely a no on exec_time, though. I don't think that has any business in auditing and while I get that it would be more efficient if one tool could to everything it is just not a good direction for us to go.

dwsteele avatar Apr 29 '25 13:04 dwsteele