data-diff
data-diff copied to clipboard
New database driver: Exasol
I'm happy to introduce a new db-driver for Exasol
Development & Testing
Local Dev Setup
- Bring up the Exasol Docker container:
docker-compose up -d exasol
- Download Sample Data
curl https://datafold-public.s3.us-west-2.amazonaws.com/1m.csv -o dev/ratings.csv
- Run python script to import CSV data into Exasol
poetry run python3 dev/exasol/_exasol_import_csv.py
- Run data-diff
poetry run python3 -m data_diff exasol://sys:exasol@localhost/DATADIFF rating exasol://sys:exasol@localhost/DATADIFF rating_del1 --verbose
Unittest
- Bring up mysql and postgres
docker-compose up -d mysql postgres
- Set env variable for Exasol URI
export DATADIFF_EXASOL_URI=exasol://sys:exasol@localhost/DATADIFF
- Run unittests
poetry run unittest-parallel -j 16
Note on Exasol Docker: There is currently an issue with Exasol and docker-compose which could lead to errors when restarting the existing container. If this happens you need to run the script in ./dev/exasol/reset_ip.sh
to which will reset the IP config for Exasol and fix the problem.
Looks like there's a version collision on numpy, with one of the other optional drivers. Is there a way to avoid this nested requirement?
Also, once it's stable, please rebase (git rebase -i
) the commits down to a few. (i.e. without cleanup and fix commits)
@erezsh I fixed the numpy version for Python 3.7.1 in poetry.lock and cleaned up my commits.
@erezsh Thanks for your indications! I fixed the issues and ran black on all files.
@nklsw The tests fail unless I import the csv, since the schema needs to be created first. But also, when using the test connection string, calling "create schema" fails with a 'schema not found' error. The only way forward I can see is to connect without a schema (like in the import) and create the schema, before doing connect_to_uri()
, but then we can't use the URI, which means the user will have to specify another env variable for a URI without a schema, which is a little awkward.
Can you think of an elegant solution to this problem?
@erezsh The problem only concerns testing, right?
At first glance, I also only see the possibility to create the schema before doing connect_to_uri()
. A semi-elegant way would be to place this in in tests/common.py
:
if CONN_STRINGS[db.Exasol]:
import dsnparse
import pyexasol
dsn = dsnparse.parse(CONN_STRINGS[db.Exasol])
exa = pyexasol.connect(dsn=dsn.host, user=dsn.user, password=dsn.password)
exa.execute(f"CREATE SCHEMA IF NOT EXISTS \"{dsn.paths[0]}\"")
exa.commit()
Although it is not optimal to use a different connection logic, in the tests the default connection via connect_to_uri()
would still be used normally and the user only has to create a single env variable for the URI. What do you think?
I ended up adding this:
if CONNS[db.Exasol]:
kw = dict(CONNS[db.Exasol].kwargs)
kw['driver'] = 'exasol'
kw['host'] = kw.pop('dsn')
schema = kw.pop('schema')
conn_without_schema = db.connect.connect(kw)
conn_without_schema.query(f'CREATE SCHEMA IF NOT EXISTS {schema}', None)
There were a few things I had to fix, see PR #203
The only problem left that I wasn't sure how to fix are these lines:
if target_db is db.Exasol and source_db is db.PostgreSQL and source_type == "timestamp with time zone":
values_in_source = ((a, b.replace(tzinfo=None)) for a, b in values_in_source)
This doesn't look right. Maybe you need to set the timezone to 0 in Exasol itself?
The problem is, that _insert_to_table()
would try to insert something like 2022-08-12 12:23:45+0000
but Exasol only accepts YYYY-MM-DD HH24:MI:SS.FF6
for inserting timestamps. I'm currently not sure, how to deal with it in a better way, but I agree with you, this doesn't look robust now.
@nklsw So the problem is just with the format of the date? Then you can modify _insert_to_table()
to reformat it according to the target db.