dlt
dlt copied to clipboard
Feat/1492 extend timestamp config
Description
dlt does not support timestamps without timezones and to be able to pass such timestamps form source to destination we need to extend the core library.
The scope of this PR is to implement a proof of concept (POC) for DuckDB and parquet. This includes updating the type mapper and adding tests where the timezone is passed via columns.
Additionally, a complementary notebook with various experiments is included in this PR.
https://github.com/dlt-hub/dlt/blob/feat/1492-extend-timestamp-config/notebooks/exp-timestamps-timezone.ipynb
@rudolfix requirements to solve #1492 :
- We'd need a new hint in
TColumnTypethat will say if data_type contains timezone or not (or define time zone as string, with default to "UTC"). Seeprecisionandscalehints already there. The idea is the same. - Each of our destinations has a type mapper which needs to be be able to interpret this new thing.
pyarrow.pycontains functions that convert dlt schema into arrow schema and vice versa. those functions needs to be upgraded- we'll need plenty of tests: loading non-tz aware datetimes via json and parquet into all possible destinations
- let's make it work for duckdb+postgres+snowflake first
Related Issues
- Resolves #1492
Deploy Preview for dlt-hub-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 6ef862aacc6247aa2cd0eb4f7f004ff2ff555bda |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/66cf3ab0fab24a00086b8862 |
LGTM! implementation idea and code structure is good. duckdb was a hard case because we had technical debt where we silently created columns without timezones if precision required it. now we can do it right
my take: try to implement postgres and snowflake next. if you need any credentials ping us but I think you have access to CI vault
thx for the tests
@rudolfix I've implemented your review comments, and added more tests.
I'll review & fix the CICD failing tests.
Added postgresql dlt's implementation and tests.
Links to my notebooks:
| Notebook | Link |
|---|---|
| DuckDB Timestamps Timezone | 1492-duckdb-timestamps-timezone.ipynb |
| PostgreSQL Timestamps Timezone | 1492-postgres-timestamps-timezone.ipynb |
Working on Snowflake.
@rudolfix - Comment updated.
warning log added.
BigQuery - I attempted to implement the timezone flag for BigQuery using both TIMESTAMP and DATETIME data types, but I encountered issues. BigQuery does not automatically handle timezones in incoming data for the DATETIME type, leading to errors. Tansforming (or stripping) JSON or other data types before insertion doesn't seem to be a viable solution.
I have documented my findings here:
BigQuery Exploration Notebook
Athena - Athena only supports a single TIMESTAMP type, which does not allow for timezone or precision configuration. AWS documentation can be confusing; they list DDL and DML columns, with "Not Available" indicating a lack of support for other TIMESTAMP types. I ran a few tests on my AWS account with the following data:
File uploaded to s3://test-athena-julian/test/1.txt:
1 2024-07-30T10:00:00.12
2 2024-07-30T11:00:00.123456
3 2024-07-30T12:00:00.1234
4 2024-07-30T10:00:00.1234567
5 2024-07-30T11:00:00.123456789
I used the following DDL statement to create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
id INT,
t TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 's3://test-athena-julian/test';