pev2 icon indicating copy to clipboard operation
pev2 copied to clipboard

Bad seqscan time reported

Open ioguix opened this issue 3 months ago • 6 comments

Hi,

Here is an artificial plan (build upon a real private one) showing a wrong time for the seqscan on relation called "victor_zulu seven_foxtrot":

https://explain.dalibo.com/plan/846ddcg83572da2d#raw

The real seqscan time is Actual stop Time - Actual Startup Time, so less than a millisecond here, not 1992ms as reported which looks like the "Actual stop time".

Regards,

ioguix avatar Sep 23 '25 13:09 ioguix

Postgres explain times are not start and stop times, they are time taken to first row, and time taken for all rows on average across loops. In this case 1992ms is correct.

ants avatar Nov 25 '25 22:11 ants

Postgres explain times are not start and stop times, they are time taken to first row, and time taken for all rows on average across loops. In this case 1992ms is correct.

Indeed. But this is a seqscan with no filter. There's no startup/initialization cost for a seqscan. You open the relation, you emit the first row. So where these 2s comes from?

Maybe the HashJoin has been stuck for some reason after hashing the inner relation? or maybe some IO high latency for this very first block?

ioguix avatar Nov 26 '25 22:11 ioguix

Hash join delays would not get accounted to the seqscan, but the seq scan could be delayed for a variety of reasons, IO as you mentioned, or bloat where first visible row is at the end of the table, or I'm not so sure this is without a filter, so it just might be filtering out everything in the beginning of the table. No point in speculating without the information. But the original premise is just a misunderstanding of how the time measurements work.

ants avatar Nov 26 '25 22:11 ants

Related to #421 ?

pgiraud avatar Nov 27 '25 07:11 pgiraud

Related to #421 ?

Related in that in both cases I don't see PEV doing anything wrong. The raw explain says what it says. The information needed to diagnose the actual problem is not there.

ants avatar Nov 27 '25 07:11 ants

I think we agree that the information to diagnose the actual problem is not here. This plan was not artificial, it comes from a real in production explain analyze. Re-execution of the same query didn't expose this timing on this seqscan. Several weeks or months later, we'll not find what happened for this specific execution.

But I disagree about that fact that pev2 is right here. I know given times are first/last rows, we discussed that offline with @pgiraud in the past, this was not a misunderstanding in my first message. My wording might be misleading, sorry.

I also know explain lack some information to make the time computing exact. But I think we can probably do a better job if we consider the node kind.

How frequent is it to have a huge table being 99% bloated? How much frequent is it to have a 2s active seqscan spitting nothing and suddenly throwing 213 rows in 66ms? I think focusing the developper attention on this seqscan is a waste of time, and the experience with this query agrees. At least, we could flag any seqscan showing this kind of numbers as a potential anomaly or needing an urgent maintenance.

But the way pev2 currently compute node timing is wrong. As an example, Hash Join timing is computing from the last rows fetched from the underlying node (@pgiraud correct me if I'm wrong). But we know it actually build a Hash table with the inner relation while pulling these inner rows, and this time is probably accounted on the wrong node in pev2. The same probably applies to the sorting node which starts working before pulling every rows.

Maybe this discussion will end up with a patch in core PostgreSQL to expose the actual, real, starting time of each node in the plan? …in fact, I played with this idea few years ago, I should probably get back to the workbench.

ioguix avatar Nov 28 '25 13:11 ioguix