dbt_ad_reporting icon indicating copy to clipboard operation
dbt_ad_reporting copied to clipboard

[Bug] Unable to Find Source Schemas on Database that is Different From the Target Database (Multi-Database Setup)

Open KurtDrew opened this issue 1 year ago β€’ 18 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Describe the issue

Selecting the source database using relevant <connector>_database (reddit_ads_database etc.) variables in the dbt_project.yml file is not working as expected when the target database is not the same as the database where the data is stored. In this instance many "Please be aware" messages about being unable to find source tables in the database.

After reviewing the debug level logs and running a few tests, I observed that when running the staging models, a query against the Redshift information_schema.tables table is run but is pointing to the information_schema schema on the target database and not the data source database. Which is why the tables cannot be found.

Relevant error log or model output

CLI Output:

Please be aware: The CAMPAIGN_REPORT table was not found in your REDDIT_ADS schema(s). The Fivetran dbt package will create a completely empty CAMPAIGN_REPORT staging model as to not break downstream transformations. To turn off these warnings, set the `fivetran__remove_empty_table_warnings` variable to TRUE (see https://github.com/fivetran/dbt_fivetran_utils/tree/releases/v0.4.latest#union_data-source for details).

Debugging Logs:

select
        table_catalog as database,
        table_name as name,
        table_schema as schema,
        'table' as type
    from information_schema.tables
    where table_schema ilike 'fivetran_reddit_ads'
    and table_type = 'BASE TABLE'
    union all
    select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      case
        when view_definition ilike '%create materialized view%'
          then 'materialized_view'
        else 'view'
      end as type
    from information_schema.views
    where table_schema ilike 'fivetran_reddit_ads'
[0m01:29:19.613576 [debug] [Thread-3 (]: SQL status: SUCCESS in 0.133 seconds
[0m01:29:19.614802 [debug] [Thread-3 (]: While listing relations in database=ingestion, schema=fivetran_reddit_ads, found: 

Expected behavior

Expected that the required source tables could be identified in the ingestion database. There is no issue if the target database is the same as the target database.

Possible solution

The redshift__list_relations_without_caching appears to be what is used to list the tables in a schema. Not sure what needs to happen to get it pointing at the correct database.

dbt Project configurations

vars:
  linkedin_ads_database: ingestion
  linkedin_ads_schema: fivetran_linkedin_ads
  pinterest_database: ingestion
  pinterest_schema: fivetran_pinterest_ads
  reddit_ads_database: ingestion
  reddit_ads_schema: fivetran_reddit_ads

Package versions

  • package: dbt-labs/redshift version: 0.9.0
  • package: fivetran/ad_reporting version: [">=1.10.0", "<1.11.0"]

What database are you using dbt with?

redshift

How are you running this dbt package?

dbt Coreβ„’

dbt Version

Core:

  • installed: 1.8.9
  • latest: 1.8.9 - Up to date!

Plugins:

  • redshift: 1.8.1 - Up to date!
  • postgres: 1.8.2 - Up to date!

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • [ ] Yes.
  • [X] Yes, but I will need assistance.
  • [ ] No.

KurtDrew avatar Dec 02 '24 21:12 KurtDrew

Hi @Kryt87 thanks for opening this issue, but sorry to hear you're running into this problem.

Quick question before diving into what could be happening here. What is the name of your source database you would expect the package to be reading from as opposed to the target? In your reddit_ads_database variable I see you have defined ingestion, and then I can see in the log that ingestion was used for the query where no campaign_report table was found.

While listing relations in database=ingestion, schema=fivetran_reddit_ads

Is the ingestion database the source or is there a different database you would like the package to read from?

fivetran-joemarkiewicz avatar Dec 02 '24 22:12 fivetran-joemarkiewicz

Hi Joe, thanks for jumping on this so fast. The source database is called ingestion, where Fivetran is writing to and we want to read from. Our target database, where we want the models to reside is called prod.

KurtDrew avatar Dec 02 '24 22:12 KurtDrew

Thanks for clarifying that! It does seem like from the logs that the package is reading from the ingestion database πŸ€”

Would you be able to confirm that in this case you do see the campaign_report table in the ingestion.fivetran_reddit_ads location in your Redshift destination?

fivetran-joemarkiewicz avatar Dec 02 '24 22:12 fivetran-joemarkiewicz

I can confirm that the campaign_report table exists in ingestion.fivetran_reddit_ads.

However, when I ran a few tests I was able to confirm that the packages was not reading from the ingestion database. In a dev target database I created the dummy table dbt_kurt_drew.fivetran_reddit_ads.test_reddit_ads. I then ran it again, where the debugging logs showed:

[0m03:13:30.546025 [debug] [Thread-1 (]: While listing relations in database=ingestion, schema=fivetran_reddit_ads, found: test_reddit_ads
[0m03:13:30.549303 [warn ] [Thread-1 (]: 

Please be aware: The CAMPAIGN_REPORT table was not found in your REDDIT_ADS schema(s).

This dummy table does not exist in the ingestion database.

KurtDrew avatar Dec 02 '24 22:12 KurtDrew

Thanks for sharing. One last question before diving deeper - do you have any other configs (outside of what you already shared in the issue description) in your dbt_project.yml vars or models sections that could be causing the package to read from the incorrect location?

fivetran-joemarkiewicz avatar Dec 02 '24 22:12 fivetran-joemarkiewicz

Nothing that stands out to me 😞

KurtDrew avatar Dec 02 '24 23:12 KurtDrew

Thanks for checking @Kryt87, I was able to dig into this deeper and believe I may have found the culprit here.

I was able to recreate this error on Redshift. I did notice that this worked as expected on all other destinations which was a bit odd to me. However, for Redshift I saw the exact same behavior as you mentioned. I then went under the hood and was inspecting these lines which are run during the first query step of the Ad Reporting models and found that the proper source database.schema.table was being used. However, the relation.value would always be None and therefore result in the empty table warning you're seeing.

So it seems the package is interpreting the variables properly, but the actual relation.value is not working as expected. I then decided to hardcode a select * from database.schema.value in one of my models and I immediately saw the below error.

Database Error in model stg_reddit_ads__ad_tmp (models/tmp/stg_reddit_ads__ad_tmp.sql)
  Datasharing is not enabled on this cluster

This would make sense why the variables are working as expected, but then the actual relation is failing since datasharing isn't enabled on my end. Do you by chance have datasharing enabled on your end to facilitate this? I imagine if you're able to enable this then you'll be able see success when reading from one database and write to another.

Let me know if this helps address your issue!

fivetran-joemarkiewicz avatar Dec 03 '24 19:12 fivetran-joemarkiewicz

I'm glad you are seeing the same error, I find Redshift can be very particular about this kind of thing. Regarding data sharing, it is enabled for the cluster. However, there are no data shares set up between these two databases and there shouldn't need to be as they are on the same cluster. Data sharing is primarily designed to facilitate sharing data across different Redshift clusters, which can be in different AWS accounts or regions.

When testing, I created both a staging model (view) that sourced the campaign_report and referenced it in an intermediate model. This all worked as expected, even though the source was on the ingestion database and the staging and intermediate models were created on a seperate dev databases.

KurtDrew avatar Dec 03 '24 20:12 KurtDrew

I also was confused why the data sharing was not enabled for the same cluster, but that's the error I was seeing.

It's also strange that you're able to query directly from the source in that staging model yet you don't see the error I saw. Could you share your test staging model view and the log result? I'm wondering if something may be causing issues at the adapter.get_relations level within dbt-redshift itself πŸ€”

fivetran-joemarkiewicz avatar Dec 03 '24 20:12 fivetran-joemarkiewicz

I think you are on the right track with dbt-redshift.

The view is rather simple:

with source as (

    select * from {{ source('fivetran_reddit_ads', 'campaign_report') }}

),

renamed as (

    select * from source

)

select * from renamed

With the source being:


sources:
  - name: fivetran_reddit_ads
    description: ""
    database: ingestion

    tables:
      - name: campaign_report
        description: ""

debug.log - slimmed down debugging logs

KurtDrew avatar Dec 03 '24 21:12 KurtDrew

Thanks so much for sharing @Kryt87. I've been looking into this more, but struggling on my end since I continue to run into the data sharing error. I want to narrow this down even further to the dbt-redshift adapter and see if the adapter.get_relation is the culprit. Would you be able to try and use the below version in your test model. I would be curious if this works as expected, or if it returns the relation "none" does not exist error I end up seeing.

{% set relation = namespace(value="") %}
{%- set relation.value=adapter.get_relation(
    database=source('fivetran_reddit_ads', 'campaign_report').database,
    schema=source('fivetran_reddit_ads', 'campaign_report').schema,
    identifier='campaign_report'
) -%}

with source as (

    select * from {{ relation.value }}

),

renamed as (

    select * from source

)

select * from renamed

Let me know if this ends up working for you. If it doesn't then we may need to open an issue on the dbt-redshift adapter to have those maintainers help address this issue. If not, then we can narrow this issue down to some component in the package not working as expected.

fivetran-joemarkiewicz avatar Dec 04 '24 21:12 fivetran-joemarkiewicz

Maybe I have something set up wrong, but I am getting the following error:

21:37:24    Database Error in model stg_reddit_ads_campaign_report (models/staging/reddit_ads/campaign_report/stg_reddit_ads_campaign_report.sql)
  All the relation names inside should be qualified when creating VIEW WITH NO SCHEMA BINDING.
  compiled code at target/run/data_products/models/staging/reddit_ads/campaign_report/stg_reddit_ads_campaign_report.sql

KurtDrew avatar Dec 04 '24 21:12 KurtDrew

The compiled code is:

  create view "dbt_kurt_drew"."public"."stg_reddit_ads_campaign_report__dbt_tmp" as (
    with source as (

    select * from None

),

renamed as (

    select * from source

)

select * from renamed
  ) with no schema binding;

KurtDrew avatar Dec 04 '24 21:12 KurtDrew

Thanks for sharing @Kryt87, this is promising in the sense that we see in your compiled code that relation.value is returning None as I'm seeing on my end as well. This makes me consider the adapter.get_relations is not working as expected.

However, it is interesting to see the with no schema binding component of the compiled code. Would you be able to share the reason for this addition and how you've added it to be included when materializing models? I'm not entirely sure this is an issue, but wanted to ask since the addition of this does produce a different error from what I'm seeing.

fivetran-joemarkiewicz avatar Dec 04 '24 21:12 fivetran-joemarkiewicz

FYI I have a draft bug report for the dbt-redshift repo that I can send over if we don't find anything obvious with the no schema binding config. I'll hold off submitting that issue until we confirm this is not possibly contributing to this issue.

fivetran-joemarkiewicz avatar Dec 04 '24 23:12 fivetran-joemarkiewicz

We have had some issues with table bindings in the past, so in dbt_project.yml we put bind: false under models:

KurtDrew avatar Dec 05 '24 00:12 KurtDrew

Thanks for sharing @Kryt87. I don't imagine that config is the cause of these issues. I've gone ahead and opened the above issue on the dbt-redshift repo. Let's move our conversation there for the meantime to see if the maintainers of that repo are able to help shed light on what may be going on. Thanks!

fivetran-joemarkiewicz avatar Dec 05 '24 14:12 fivetran-joemarkiewicz

@Kryt87, I don't know if you've been following the linked dbt-redshift issue above but dbt Labs has confirmed that Redshift does not support cross-db querying 😞. The issue has been consolidated within this feature request on the dbt-redshift repo.

As dbt-redshift does not support cross-db querying, we are blocked from allowing this to work as we would expect within the dbt package. Unfortunately we are limited now to the prioritization of the above mentioned feature request within dbt-redshift.

Since we're blocked, I've opted to create PR dbt-labs/dbt-redshift#128 to clarify this limitation in the README for users wanting to use this package across databases and that this operation is limited to the dbt adapter's capabilities. Additionally, I'll need to mark this ticket as won't fix since we unfortunately won't be able to address this issue at the dbt package level. I sincerely apologize this doesn't address your original issue, but I encourage you to post in the dbt-redshift feature request linked above to let the folks at dbt Labs know this is an important feature to you.

fivetran-joemarkiewicz avatar Dec 20 '24 15:12 fivetran-joemarkiewicz