data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

New database driver: Exasol

Open nklsw opened this issue 1 year ago • 9 comments

I'm happy to introduce a new db-driver for Exasol

Development & Testing

Local Dev Setup

  1. Bring up the Exasol Docker container:
docker-compose up -d exasol
  1. Download Sample Data
curl https://datafold-public.s3.us-west-2.amazonaws.com/1m.csv -o dev/ratings.csv
  1. Run python script to import CSV data into Exasol
poetry run python3 dev/exasol/_exasol_import_csv.py
  1. Run data-diff
poetry run python3 -m data_diff exasol://sys:exasol@localhost/DATADIFF rating exasol://sys:exasol@localhost/DATADIFF rating_del1 --verbose

Unittest

  1. Bring up mysql and postgres
docker-compose up -d mysql postgres
  1. Set env variable for Exasol URI
export DATADIFF_EXASOL_URI=exasol://sys:exasol@localhost/DATADIFF
  1. Run unittests
poetry run unittest-parallel -j 16

Note on Exasol Docker: There is currently an issue with Exasol and docker-compose which could lead to errors when restarting the existing container. If this happens you need to run the script in ./dev/exasol/reset_ip.sh to which will reset the IP config for Exasol and fix the problem.

nklsw avatar Aug 10 '22 09:08 nklsw

Looks like there's a version collision on numpy, with one of the other optional drivers. Is there a way to avoid this nested requirement?

Also, once it's stable, please rebase (git rebase -i) the commits down to a few. (i.e. without cleanup and fix commits)

erezsh avatar Aug 10 '22 09:08 erezsh

@erezsh I fixed the numpy version for Python 3.7.1 in poetry.lock and cleaned up my commits.

nklsw avatar Aug 10 '22 15:08 nklsw

@erezsh Thanks for your indications! I fixed the issues and ran black on all files.

nklsw avatar Aug 11 '22 13:08 nklsw

@nklsw The tests fail unless I import the csv, since the schema needs to be created first. But also, when using the test connection string, calling "create schema" fails with a 'schema not found' error. The only way forward I can see is to connect without a schema (like in the import) and create the schema, before doing connect_to_uri(), but then we can't use the URI, which means the user will have to specify another env variable for a URI without a schema, which is a little awkward.

Can you think of an elegant solution to this problem?

erezsh avatar Aug 11 '22 14:08 erezsh

@erezsh The problem only concerns testing, right?

At first glance, I also only see the possibility to create the schema before doing connect_to_uri(). A semi-elegant way would be to place this in in tests/common.py:

if CONN_STRINGS[db.Exasol]:
    import dsnparse
    import pyexasol
    dsn = dsnparse.parse(CONN_STRINGS[db.Exasol])
    exa = pyexasol.connect(dsn=dsn.host, user=dsn.user, password=dsn.password)
    exa.execute(f"CREATE SCHEMA IF NOT EXISTS \"{dsn.paths[0]}\"")
    exa.commit()

Although it is not optimal to use a different connection logic, in the tests the default connection via connect_to_uri() would still be used normally and the user only has to create a single env variable for the URI. What do you think?

nklsw avatar Aug 11 '22 17:08 nklsw

I ended up adding this:

if CONNS[db.Exasol]:
    kw = dict(CONNS[db.Exasol].kwargs)
    kw['driver'] = 'exasol'
    kw['host'] = kw.pop('dsn')
    schema = kw.pop('schema')
    conn_without_schema = db.connect.connect(kw)
    conn_without_schema.query(f'CREATE SCHEMA IF NOT EXISTS {schema}', None)

erezsh avatar Aug 11 '22 17:08 erezsh

There were a few things I had to fix, see PR #203

The only problem left that I wasn't sure how to fix are these lines:

        if target_db is db.Exasol and source_db is db.PostgreSQL and source_type == "timestamp with time zone":
            values_in_source = ((a, b.replace(tzinfo=None)) for a, b in values_in_source)

This doesn't look right. Maybe you need to set the timezone to 0 in Exasol itself?

erezsh avatar Aug 11 '22 17:08 erezsh

The problem is, that _insert_to_table() would try to insert something like 2022-08-12 12:23:45+0000 but Exasol only accepts YYYY-MM-DD HH24:MI:SS.FF6 for inserting timestamps. I'm currently not sure, how to deal with it in a better way, but I agree with you, this doesn't look robust now.

nklsw avatar Aug 12 '22 05:08 nklsw

@nklsw So the problem is just with the format of the date? Then you can modify _insert_to_table() to reformat it according to the target db.

erezsh avatar Aug 12 '22 07:08 erezsh