dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Feat/1492 extend timestamp config

Open donotpush opened this issue 1 year ago • 1 comments

Description

dlt does not support timestamps without timezones and to be able to pass such timestamps form source to destination we need to extend the core library.

The scope of this PR is to implement a proof of concept (POC) for DuckDB and parquet. This includes updating the type mapper and adding tests where the timezone is passed via columns.

Additionally, a complementary notebook with various experiments is included in this PR.

https://github.com/dlt-hub/dlt/blob/feat/1492-extend-timestamp-config/notebooks/exp-timestamps-timezone.ipynb

@rudolfix requirements to solve #1492 :

  1. We'd need a new hint in TColumnType that will say if data_type contains timezone or not (or define time zone as string, with default to "UTC"). See precision and scale hints already there. The idea is the same.
  2. Each of our destinations has a type mapper which needs to be be able to interpret this new thing.
  3. pyarrow.py contains functions that convert dlt schema into arrow schema and vice versa. those functions needs to be upgraded
  4. we'll need plenty of tests: loading non-tz aware datetimes via json and parquet into all possible destinations
  5. let's make it work for duckdb+postgres+snowflake first

Related Issues

  • Resolves #1492

donotpush avatar Aug 07 '24 10:08 donotpush

Deploy Preview for dlt-hub-docs canceled.

Name Link
Latest commit 6ef862aacc6247aa2cd0eb4f7f004ff2ff555bda
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66cf3ab0fab24a00086b8862

netlify[bot] avatar Aug 07 '24 10:08 netlify[bot]

LGTM! implementation idea and code structure is good. duckdb was a hard case because we had technical debt where we silently created columns without timezones if precision required it. now we can do it right

my take: try to implement postgres and snowflake next. if you need any credentials ping us but I think you have access to CI vault

thx for the tests

@rudolfix I've implemented your review comments, and added more tests.

I'll review & fix the CICD failing tests.

donotpush avatar Aug 13 '24 10:08 donotpush

Added postgresql dlt's implementation and tests.

Links to my notebooks:

Notebook Link
DuckDB Timestamps Timezone 1492-duckdb-timestamps-timezone.ipynb
PostgreSQL Timestamps Timezone 1492-postgres-timestamps-timezone.ipynb

Working on Snowflake.

donotpush avatar Aug 20 '24 13:08 donotpush

@rudolfix - Comment updated.

warning log added.

BigQuery - I attempted to implement the timezone flag for BigQuery using both TIMESTAMP and DATETIME data types, but I encountered issues. BigQuery does not automatically handle timezones in incoming data for the DATETIME type, leading to errors. Tansforming (or stripping) JSON or other data types before insertion doesn't seem to be a viable solution.

I have documented my findings here:
BigQuery Exploration Notebook

Athena - Athena only supports a single TIMESTAMP type, which does not allow for timezone or precision configuration. AWS documentation can be confusing; they list DDL and DML columns, with "Not Available" indicating a lack of support for other TIMESTAMP types. I ran a few tests on my AWS account with the following data:

File uploaded to s3://test-athena-julian/test/1.txt:

1 2024-07-30T10:00:00.12
2 2024-07-30T11:00:00.123456
3 2024-07-30T12:00:00.1234
4 2024-07-30T10:00:00.1234567
5 2024-07-30T11:00:00.123456789

I used the following DDL statement to create the table:

CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
  id INT,
  t TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 's3://test-athena-julian/test';

Amazon Athena Data Types Documentation

donotpush avatar Aug 23 '24 16:08 donotpush