dlt Data type not detected if first batch of records is null

Data type not detected if first batch of records is null

Open rubenhelsloot opened this issue 1 year ago • 2 comments

dlt version

0.5.2

Describe the problem

I have a SQL database in SAP HANA that I want to transport to Postgres using DLT. However, I am running into an issue for a few of the tables that have a lot of NULLs. The columns are of type TIMESTAMP, BOOLEAN, or INT64, but there are so many NULLs in there, that the first chunk of loaded SQL contains only NULLs.

Therefore, the data type is incorrectly inferred as STRING, which then leads to an error loading the Parquet files:

ValueError: Table schema does not match schema used to create file: 
table:
objectid: double
...
my_field: double vs. 
file:
objectid: double
...
my_field: string

Some notes

I am using the pandas backend of the sql_database source.
For date specific columns, I have been able to work around the issue by adding the following to TableLoader._load_rows:

date_cols = [
    col for col in self.columns
    if self.columns[col].get("data_type") in ("timestamp", "date", "time", "timestamptz")
]

df = _wrap_result(
    partition,
    columns,
    parse_dates=date_cols or None,
    **{"dtype_backend": "pyarrow", **backend_kwargs},
)

Note that when we generate the column definition for dlt, the sqlalchemy metadata object actually gives the correct data type. Is there a reason we wouldn't just use that?

Expected behavior

The dataframe chunks should extract the data type from the column definitions - at least if the values are all nulls

Steps to reproduce

Create a database with a table with only NULLs in the first rows. GIve it a date type that is not string. Then set the chunks to be smaller than the number of leading NULLs.

Operating system

macOS

Runtime environment

Airflow

Python version

3.11

dlt data source

sql_database

dlt destination

Postgres

Other deployment details

No response

Additional information

No response

Aug 05 '24 20:08 rubenhelsloot

dlt dlt copied to clipboard

Data type not detected if first batch of records is null

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

dlt
dlt copied to clipboard