dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

Cannot run dbt docs generate with JSON logs

Open robertf-b opened this issue 2 years ago • 8 comments

Describe the bug

When running dbt docs generate with JSON logs enabled I receive an error: Encountered an error while generating catalog: Object of type DatabricksRelation is not JSON serializable. This occurs when using dbt-databricks 1.1.0 on all of locally (Windows), Docker (Linux) and the preview Databricks dbt task type. It does not occur in earlier versions. It does not occur with the default log format.

Steps To Reproduce

  1. Run dbt --log-format json docs generate

Expected behavior

Doc site generates correctly, with JSON logs.

Screenshots and log output

{"code": "E044", "data": {}, "invocation_id": "4240ec3d-ef2f-4772-8540-d4d522a8c717", "level": "info", "log_version": 2, "msg": "Building catalog", "pid": 1628, "thread_name": "MainThread", "ts": "2022-05-25T14:22:49.779873Z", "type": "log_line"}
{"code": "Z046", "data": {"log_fmt": null, "msg": "Encountered an error while generating catalog: Object of type DatabricksRelation is not JSON serializable"}, "invocation_id": "4240ec3d-ef2f-4772-8540-d4d522a8c717", "level": "warn", "log_version": 2, "msg": "Encountered an error while generating catalog: Object of type DatabricksRelation is not JSON serializable", "pid": 1628, "thread_name": "MainThread", "ts": "2022-05-25T14:22:49.783320Z", "type": "log_line"}
{"code": "E041", "data": {"num_exceptions": 1}, "invocation_id": "4240ec3d-ef2f-4772-8540-d4d522a8c717", "level": "error", "log_version": 2, "msg": "dbt encountered 1 failure while writing the catalog", "pid": 1628, "thread_name": "MainThread", "ts": "2022-05-25T14:22:49.792754Z", "type": "log_line"}

System information

Windows:

Core:
  - installed: 1.1.0
  - latest:    1.1.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.1.0 - Up to date!
  - spark:      1.1.0 - Up to date!

The operating system you're using: Windows/Linux/Databricks

The output of python --version: Windows: Python 3.9.0

Additional context

robertf-b avatar Jun 17 '22 09:06 robertf-b

@ueshin @allisonwang-db I can repro this on my laptop, too.

bilalaslamseattle avatar Jun 17 '22 09:06 bilalaslamseattle

@jtcohen6 Could you help take a look at this issue?

Seems like SparkRelation or even BaseRelation are not serializable.

>>> json.dumps(SparkRelation.create(schema='a', identifier='b'))
...
TypeError: Object of type SparkRelation is not JSON serializable

>>> json.dumps(BaseRelation.create(schema='a', identifier='b'))
Traceback (most recent call last):
...
TypeError: Object of type BaseRelation is not JSON serializable

Thanks.

ueshin avatar Jun 17 '22 18:06 ueshin

Seems like it tries to show more logs in 1.1 than 1.0, that is breaking the command.

ueshin avatar Jun 17 '22 18:06 ueshin

I'm able to reproduce this locally with the latest dbt-databricks + dbt-core. I'm trying to figure out where in the methods called by docs generate a DatabricksRelation is being passed directly into a log message, and then formatted into JSON, without any intermediate serialization steps.

BaseRelation (and thereby SparkRelation + DatabricksRelation) are not directly JSON serializable, but they can be converted to dictionaries that are, via the to_dict() method of the parent class dbtClassMixin:

>>> import json
>>> from dbt.adapters.base import BaseRelation
>>> json.dumps(BaseRelation.create(schema='a', identifier='b').to_dict())
'{"path": {"database": null, "schema": "a", "identifier": "b"}, "type": null, "quote_character": "\\"", "include_policy": {"database": true, "schema": true, "identifier": true}, "quote_policy": {"database": true, "schema": true, "identifier": true}, "dbt_created": false}'

jtcohen6 avatar Jun 17 '22 19:06 jtcohen6

I found a simple fix for this, but I'd be curious to get @nathaniel-may's take on it before merging

jtcohen6 avatar Jun 17 '22 19:06 jtcohen6

@jtcohen6 any update on this? I'd love to close this out in the next point release.

bilalaslamseattle avatar Jul 21 '22 09:07 bilalaslamseattle

@bilalaslamseattle Thanks for flagging this again.

We've seen multiple issues in this category, across multiple adapters, and we think there exists a general-purpose solution that will be the right move longer-term: https://github.com/dbt-labs/dbt-core/issues/5436

The work for that is definitely on our radar. If it appears that the general-purpose resolution will be too complex, we can put a one-off patch for this in dbt-spark, to unblock the user here.

(cc @nathaniel-may)

jtcohen6 avatar Jul 22 '22 12:07 jtcohen6

Hiya people on the thread.

Per this core PR, this JSON serialization bug should be solved across all adapters. There's a lot of layers of indirection in the logger call stack, so finding the root cause of this error took us some concerted time. I also threw on our PRs backlog tags, so in theory, you should be able to "seamlessly" integrate the fix into your env on the new release. (it's also live in main)

I'd love to close this if (🤞) people report things working here.

VersusFacit avatar Oct 20 '22 06:10 VersusFacit