dbt-databricks lazy load agate

TLDR: lazy-loading agate speeds up the load time of dbt by about 3.5%. Most instances of agate are unnecessary as dbt only uses it in a select few situations, and most instances of agate can be placed behind TYPE_CHECKING.

Already implemented in dbt-adapters and dbt-core.

https://github.com/dbt-labs/dbt-adapters/pull/126
https://github.com/dbt-labs/dbt-core/pull/9744

Note: usually I perform the following check with each adapter. The following code should return an empty list:

import sys
import dbt.adapters.databricks.impl
print([i for i in sys.modules if "agate" in i])

However, because dbt-databricks has not yet been migrated to the dbt-core 1.8 ecosystem, this check is not successful:

>>> print([i for i in sys.modules if "agate" in i])
['agate.exceptions', 'agate.csv_py3', 'agate.aggregations.base', 'agate.data_types.base', 'agate.data_types.boolean', 'agate.data_types.date', 'agate.data_types.date_time', 'agate.data_types.number', 'agate.data_types.text', 'agate.data_types.time_delta', 'agate.data_types', 'agate.aggregations.all', 'agate.aggregations.any', 'agate.warns', 'agate.utils', 'agate.aggregations.count', 'agate.aggregations.has_nulls', 'agate.aggregations.percentiles', 'agate.aggregations.deciles', 'agate.aggregations.first', 'agate.aggregations.iqr', 'agate.aggregations.median', 'agate.aggregations.mad', 'agate.aggregations.max', 'agate.aggregations.max_length', 'agate.aggregations.max_precision', 'agate.aggregations.sum', 'agate.aggregations.mean', 'agate.aggregations.min', 'agate.aggregations.mode', 'agate.aggregations.quartiles', 'agate.aggregations.quintiles', 'agate.aggregations.variance', 'agate.aggregations.stdev', 'agate.aggregations.summary', 'agate.aggregations', 'agate.mapped_sequence', 'agate.columns', 'agate.computations.base', 'agate.computations.change', 'agate.computations.formula', 'agate.computations.percent', 'agate.computations.percent_change', 'agate.computations.rank', 'agate.computations.percentile_rank', 'agate.computations.slug', 'agate.computations', 'agate.config', 'agate.rows', 'agate.type_tester', 'agate.table.aggregate', 'agate.table.bar_chart', 'agate.table.bins', 'agate.table.column_chart', 'agate.table.compute', 'agate.table.denormalize', 'agate.table.distinct', 'agate.table.exclude', 'agate.table.find', 'agate.table.from_csv', 'agate.fixed', 'agate.table.from_fixed', 'agate.table.from_json', 'agate.table.from_object', 'agate.tableset.aggregate', 'agate.tableset.bar_chart', 'agate.tableset.column_chart', 'agate.tableset.from_csv', 'agate.tableset.from_json', 'agate.tableset.having', 'agate.tableset.line_chart', 'agate.tableset.merge', 'agate.tableset.print_structure', 'agate.tableset.proxy_methods', 'agate.tableset.scatterplot', 'agate.tableset.to_csv', 'agate.tableset.to_json', 'agate.tableset', 'agate.table.group_by', 'agate.table.homogenize', 'agate.table.join', 'agate.table.limit', 'agate.table.line_chart', 'agate.table.merge', 'agate.table.normalize', 'agate.table.order_by', 'agate.table.pivot', 'agate.table.print_bars', 'agate.table.print_html', 'agate.table.print_structure', 'agate.table.print_table', 'agate.table.rename', 'agate.table.scatterplot', 'agate.table.select', 'agate.table.to_csv', 'agate.table.to_json', 'agate.table.where', 'agate.table', 'agate.testcase', 'agate', 'dbt.clients.agate_helper']

So, the actual code change is not blocked by #619, but it won't "work" until #619 is resolved.

Checklist

[x] I have run this code in development and it appears to resolve the stated issue
[x] This PR includes tests, or tests are not required/relevant for this PR
[x] I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

Mar 31 '24 15:03 dwreeves

We have upped to 1.8.0 in 1.8.latest (and 2.0.latest) branches. Is this change needed there as well?

Apr 01 '24 15:04 benc-db

@benc-db This change doesn't do anything in 1.7, only in 1.8. So yes, it should be rebased to that branch (didn't notice it!). I can rebase to that when I get a moment.

Apr 01 '24 16:04 dwreeves

@benc-db This change doesn't do anything in 1.7, only in 1.8. So yes, it should be rebased to that branch (didn't notice it!). I can rebase to that when I get a moment.

Yes, please rebase on 2.0.latest and target it with the PR. Our 1.8.0 compatibility is going to be released as 2.0.0.

Apr 02 '24 17:04 benc-db

Swapped to 2.0.latest.

Apr 09 '24 01:04 dwreeves

Awesome, let me just verify the integration tests pass. Appreciate the work.

Apr 09 '24 16:04 benc-db

I see that the utils tests are failing. I'm looking into it.

Apr 09 '24 22:04 benc-db

So, I'm digging into this and not understand why this suddenly breaks type comparison tests, but I'm wondering, what's the point? In pretty much every execution of dbt, we end up calling get_relations_without_caching which loads agate anyway.

Apr 09 '24 23:04 benc-db

That's interesting, I did not notice that. In the dbt-adapters library for 1.8 beta, the list_relations_without_caching doesn't appear to use agate. I'm not aware of if this was different in a previous version or if this is specific to dbt-databricks. https://github.com/dbt-labs/dbt-adapters/blob/7392debf7ff1aa5d58c0fe26eb76ac8ade4fe15b/dbt/adapters/sql/impl.py#L183

I see why the tests are failing! I accidentally used dbt.clients.agate_helper instead of dbt_common.agate_helper.

Do note: I believe dbt_common.agate_helper is deprecating in 1.8.0, and it is switching over to dbt.clients.agate_helper. So this isn't really an error per se of my code for dbt 1.8, as the dbt.clients import will work on the 1.8.0b1 release of dbt-core. You can see dbt.client.agate_helper in other dbt adapters packages (that's actually where I copy+pasted this from).

The why:

Essentially that dbt-core takes a while to spin up, and consumes a lot of resources when it does. This is part of a broader effort to improve dbt's efficiency with some free wins.

Dagster has a pretty convoluted way around this (essentially, they run a single dbt instance in a DAG and emit events on creation of each node to emulate atomicity of tasks).
Airflow's main implementation of dbt, i.e. Cosmos, does not have anything like that. So Airflow + dbt users suffer the most under the current situation of high resource utilization. (I'll admit, I am part of this camp, hence my motivation for this. dbt-core on small sized Airflow Celery workers tend to SIGKILL a lot.)

dbt 1.8 trends in the direction of consuming more resources than predecessor minor versions. Here is memory utilization of dbt --help for 1.7 and 1.8.0b1:

Script is here: https://gist.github.com/dwreeves/f945a846f1d0e07b792fef407294e387

Agate is only really needed in a few cases, is my understanding:

dbt docs generation
dbt seeds
dbt "unit tests" (new feature in 1.8)

And Agate, on the slower of my 2 computers, takes half a second to load. And it consumes a decent chunk of the total memory that dbt uses:

(note: I'm aware this is not a perfect test since agate will share many sys.modules with dbt-core, but a majority of that is just agate.)

Lazy-loading agate, as well as another small change I've already made to dbt-core 1.8.0 (should be released in b2), in combination will counteract all of the memory utilization increase that we see in 1.8.0b1 from other sources, and should bring it back to the levels we see in 1.7, and should slightly speed up dbt on a cold-start on slower computers.

This normally isn't much of a challenge to implement, all told, hence why I am implementing it. Resolving the core of the matter would require significant rewrites to dbt that are outside the scope of what I can and am willing to contribute. Basically, I'm just putting a bandaid on some issues...

Apr 09 '24 23:04 dwreeves

I'm looking at 2.0.and I don't think we need those in list_relations_without_caching...I think this was a rebase issue:

        return [
            self.Relation.create(
                database=database,
                schema=schema,
                identifier=name,
                type=self.Relation.get_relation_type(kind),
            )
            for database, schema, name, kind in results.select(  # type: ignore[attr-defined]
                ["database_name", "schema_name", "name", "kind"]
            )
        ]

Apr 09 '24 23:04 benc-db

However, even after fixing the import, my functional tests were failing for '>= cannot be used to compare a tuple with an int' or something like that, for the date macro tests.

Apr 09 '24 23:04 benc-db

 _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <dbt.clients.jinja.MacroGenerator object at 0x1077a7e50>

    @contextmanager
    def exception_handler(self) -> Iterator[None]:
        try:
            yield
        except (TypeError, jinja2.exceptions.TemplateRuntimeError) as e:
>           raise CaughtMacroErrorWithNodeError(exc=e, node=self.macro)
E           dbt_common.exceptions.macros.CaughtMacroErrorWithNodeError: Compilation Error in model test_date_spine (models/test_date_spine.sql)
E             '>=' not supported between instances of 'tuple' and 'int'
E             
E             > in macro dateadd (macros/utils/dateadd.sql)
E             > called by macro default__date_spine (macros/utils/date_spine.sql)
E             > called by macro date_spine (macros/utils/date_spine.sql)
E             > called by model test_date_spine (models/test_date_spine.sql)

Apr 10 '24 00:04 benc-db

Thank you for that, and thank you for your patience on this relatively small change that I managed to bork. I see what the issue is, I accidentally did this:

Sorry to waste your time on this. Let me get to this in a few minutes.

Apr 10 '24 00:04 dwreeves

Alright, let me pull latest and see if the tests pass

Apr 10 '24 15:04 benc-db

Please take a look at the lint failure; fix that and we're good to merge.

Apr 10 '24 16:04 benc-db

🤦 Ahhh.

Apr 16 '24 02:04 dwreeves

Hm, I guess we need to type: ignore the dbt.clients.agate_helper library? 🤨

Apr 16 '24 23:04 dwreeves

@dwreeves there have been a lot of changes (in fact, 2.0.latest is no longer being developed after my discussion with dbt labs today), so I'm going to try to replicate this against the 1.8.latest branch and get it working.

May 10 '24 21:05 benc-db

@dwreeves closing in favor of https://github.com/databricks/dbt-databricks/pull/661

May 10 '24 22:05 benc-db