dbt-core [CT-3361] Improve Docs Parsing Performance

[CT-3361] Improve Docs Parsing Performance

Open peterallenwebb opened this issue 8 months ago • 2 comments

We've received a complaint that dbt-core's parsing performance is surprisingly slow for large docs files. On an M1 Mac, files of around 500K can take over a minute to parse, and appears to increase super-linearly. The critically slow step is the call of extract_toplevel_blocks() on the file contents. The extraction of top-level jinja blocks is could likely be made much faster, but this is extremely critical code and we need to preserve existing behavior.

This does not appear to be a regression, but current performance is embarrassingly bad.

To generate a file which reproduces the performance problem, repeat the following snippet a few thousand times in a text file with the .md (markdown) extension, and add it to a dbt project, or call extract_toplevel_blocks() on it directly.

{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}

Impact on other teams

None

Needs backport?

Unsure

Nov 08 '23 16:11 peterallenwebb

dbt-core dbt-core copied to clipboard

[CT-3361] Improve Docs Parsing Performance

Impact on other teams

Needs backport?

dbt-core
dbt-core copied to clipboard