splink icon indicating copy to clipboard operation
splink copied to clipboard

[FEAT] Diagnostics mode to get summary breakdowns of SQL execution

Open ThomasHepworth opened this issue 1 year ago • 1 comments

Is your proposal related to a problem?

See the following feature request as another point of reference - https://github.com/moj-analytical-services/splink/issues/1472.

Describe the solution you'd like

It would be beneficial to generate summary statistics that encompass the following aspects:

  • Duration of execution for each Common Table Expression (CTE): Currently, this information is embedded within the debug_mode function, but it could be expanded and enhanced to measure both the total execution time and the time taken for sub-steps.
  • Amount of data examined during each step: This functionality is primarily feasible within dialects that provide EXPLAIN ANALYZE capabilities, such as DuckDB, Athena, and Spark. It's worth noting that we have the option to estimate the cardinality of our queries using the query planner. In other words, we can employ our database backends to predict the number of rows that a particular query will generate, all without executing the query itself.

Furthermore, it would be beneficial to leverage some of this work to highlight the following aspects:

  • The impact of different approaches on the overall volume of data generated at each query step, pinpointing potential bottlenecks for users.
  • The duration of execution for various queries: This aspect also provides us with the capability to scrutinize the computational time required for different comparisons, which has been a topic of interest for some time.

Initially, we can implement this by incorporating the capability to access EXPLAIN ANALYZE for SQL execution as a class attribute. Subsequently, we can refine this functionality to generate more informative summaries for users. This will help them to isolate specific bottlenecks and areas of concern, without needing to read the EXPLAIN tables.

I will continue to update this issue as additional use cases for these features become apparent.

ThomasHepworth avatar Oct 24 '23 13:10 ThomasHepworth

Robin's initial thoughts on this:

Robin Linacre - [14:39]

Yeah, so I’ve been wanted to implement something in this area for a while.

I think before implementing anything we would want to write down what it looks like from the uesrs point of view

Personally I think we might want to start with something very basic e.g. that the linker has a sql_history.

To begin with this literally could store a list of the SQL that’s been executed. But then we could begin to add to this - things like statistics etc.

Each entry in the sql history could be something like:

{"raw_sql:: sql,
"physical_name":
"template_name":
}

ThomasHepworth avatar Oct 24 '23 19:10 ThomasHepworth