snowflake-sqlalchemy
snowflake-sqlalchemy copied to clipboard
SNOW-948287: Best practice for doing cross database joins using CORE
I am literally copying this from the ORM question because it's virtually identical but the ORM approach doesn't work for CORE.
Question
Hi there, in Snowflake we often have multiple databases, with their own set of schemas, and tables inside those schemas. Ideally data shouldn't have to move between these databases nor schemas, but in practice they often do.
One use case is when there might be valuable data between two different databases. In my example, I'm going to reference database A
and database B
which has schemas ant
and baboons
, which has tables apples
and bananas
, respectively.
Through a direct query using Snowflake SQL, you could implement a join like
select
*
from A.ant.apples a
join B.baboons.bananas b
on a.id = b.id
How can you implement a cross database join like the above example using a single session through the CORE interface? My understanding is that a session can only point towards one database. I'm not creating an app and the tables already exist so ideally this would all happen through SQLAlchemy MetaData object reflection. Below are some example of what I've tried.
The ideal approach would look like the following:
import os
from sqlalchemy import create_engine, MetaData, text, Table
from snowflake.sqlalchemy import URL
from sqlalchemy.sql import select, and_, or_, case, func, insert
# Note: no database, schema and warehouse are being selected at this time. They're not needed yet.
engine = create_engine(URL(
account="my_awesome_account",
user="my_user",
password="my_password",
role="my_awesome_role",
),
)
metadata = MetaData()
metadata.reflect(engine, schema='a.ant', only=['apples',]) # doesn't work
metadata.reflect(engine, schema='a.ant', only=['a.ant.apples',]) # doesn't work
metadata.reflect(engine, only=['a.ant.apples',]) # doesn't work
Table("apples", metadata, schema='a.ant', autoload_with=engine) # doesn't work
Table("a.ant.apples", metadata, schema='a.ant', autoload_with=engine) # doesn't work
I've tried all types of combinations of cases with the the database name and schema names too. None of them work. What I find silly about the above is I often get StackTraces saying I have to provide a fully qualified name since I'm not declaring a database. But I AM providing a fully qualified name. And not explicitly choosing a database is the point. I need to pull data from tables in multiple databases so the reflected tables NEED TO BE FULLY QUALIFIED.
Can this be done through the CORE interface without dropping to SQL? If so, what's the best practice in doing this?
Please answer these questions before submitting your issue. Thanks!
-
What version of Python are you using?
Python 3.9.6
-
What operating system and processor architecture are you using?
macOS-14.0-arm64-arm-64bit
-
What are the component versions in the environment (
pip freeze
)?appnope==0.1.3 asn1crypto==1.5.1 asttokens==2.4.0 backcall==0.2.0 certifi==2023.7.22 cffi==1.16.0 charset-normalizer==3.3.0 comm==0.1.4 cryptography==41.0.4 debugpy==1.8.0 decorator==5.1.1 exceptiongroup==1.1.3 executing==2.0.0 filelock==3.12.4 idna==3.4 importlib-metadata==6.8.0 ipykernel==6.25.2 ipython==8.16.1 jaraco.classes==3.3.0 jedi==0.19.1 jupyter_client==8.4.0 jupyter_core==5.4.0 keyring==24.2.0 matplotlib-inline==0.1.6 more-itertools==10.1.0 nest-asyncio==1.5.8 numpy==1.26.1 oscrypto==1.3.0 packaging==23.2 pandas==2.0.3 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 platformdirs==3.11.0 prompt-toolkit==3.0.39 psutil==5.9.6 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pycparser==2.21 pycryptodomex==3.19.0 Pygments==2.16.1 PyJWT==2.8.0 pyOpenSSL==23.2.0 python-dateutil==2.8.2 pytz==2023.3.post1 pyzmq==25.1.1 requests==2.31.0 six==1.16.0 snowflake-connector-python==3.3.1 snowflake-sqlalchemy==1.5.0 sortedcontainers==2.4.0 SQLAlchemy==1.4.49 sqlparse==0.4.4 stack-data==0.6.3 tomlkit==0.12.1 tornado==6.3.3 traitlets==5.11.2 typing_extensions==4.8.0 tzdata==2023.3 urllib3==1.26.18 wcwidth==0.2.8 zipp==3.17.0
-
What did you do?
The above example should be sufficient.
-
What did you expect to see?
Ideally, the fully qualified table objects, including the database, would have been instantiated as SQLAlchemy Table objects. In this way, I could then create a SQLAlchemy
select
statement such asselect(a).select_from(a.join(b, on=[a.c.col1 == b.c.col1])
and the corresponding SQL generated would beSELECT A.ant.apple.* FROM A.ant.apple JOIN B.baboons.bananas ON A.ant.apple.col1 = B.baboons.bananas.col1;
.
Instead, the reflection step fails forcing me to choose a database which then prevents me from joining tables in different databases.
-
Can you set logging to DEBUG and collect the logs?
import logging import os for logger_name in ['snowflake.sqlalchemy', 'snowflake.connector']: logger = logging.getLogger(logger_name) logger.setLevel(logging.DEBUG) ch = logging.StreamHandler() ch.setLevel(logging.DEBUG) ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s')) logger.addHandler(ch)
I'm happy to provide this info but in private. It's on a company computer with company backend stuff that's not for the public eye.
Has anyone had a chance to review?
Jim - Do you have access to this: https://snowflakecomputing.atlassian.net/wiki/spaces/SKE/pages/1065292364/JOIN+Across+Databases+SYNONYM
This article provides alternatives for How To get around this using session variables and optionally identifier()
Details The need is to join data from two databases (which are environment specific) such as:
SELECT cd.cust_id, ff.measure_nbr, ... FROM ent_dev_edw.bas.cust_d cd JOIN lcl_dev_wrk.aml.fact_f ff ON ff.cust_sk = cd.cust_sk WHERE ... Use Variables
set dbEntNm = 'ent_dev_edw.'; set dbLclNm = 'lcl_dev_wrk.'; set cdNm = $dbEntNm || 'bas.cust_d'; set ffNm = $dbLclNm || 'amt.fact_f';
... FROM $cdNm cd -- or use this syntax FROM TABLE($cdNm) or this syntax FROM IDENTIFIER($cdNm) JOIN $ffNm ff ON ff.cust_sk = cd.cust_sk ... Honestly, it's a pain when you want flexibility for a lot of tables.
We do have a Synonym jira ticket out there and I will add your company to it.
-Pete
I hacked my way around it but because of how our org is set up, I'm unable to run the test suite to verify it doesn't break anything. If someone could help out, that'd be wonderful. All edits were made to snowdialect.py
, attached here.
snowdialect.py.zip
Bottom line, there are two sections in this file around reflection. I added a bit of code to check to see if the schema has a .
in it. And if so, split the schema into database
and schema
. The information_schema
is also being fully qualified in the code by changing the subsequent query to an f-string and including the database. Just run git diff on the attached zip file to see all the changes.