dlt
dlt copied to clipboard
Data type not detected if first batch of records is null
dlt version
0.5.2
Describe the problem
I have a SQL database in SAP HANA that I want to transport to Postgres using DLT. However, I am running into an issue for a few of the tables that have a lot of NULLs. The columns are of type TIMESTAMP, BOOLEAN, or INT64, but there are so many NULLs in there, that the first chunk of loaded SQL contains only NULLs.
Therefore, the data type is incorrectly inferred as STRING, which then leads to an error loading the Parquet files:
ValueError: Table schema does not match schema used to create file:
table:
objectid: double
...
my_field: double vs.
file:
objectid: double
...
my_field: string
Some notes
- I am using the
pandasbackend of thesql_databasesource. - For date specific columns, I have been able to work around the issue by adding the following to
TableLoader._load_rows:
date_cols = [
col for col in self.columns
if self.columns[col].get("data_type") in ("timestamp", "date", "time", "timestamptz")
]
df = _wrap_result(
partition,
columns,
parse_dates=date_cols or None,
**{"dtype_backend": "pyarrow", **backend_kwargs},
)
- Note that when we generate the column definition for dlt, the sqlalchemy metadata object actually gives the correct data type. Is there a reason we wouldn't just use that?
Expected behavior
The dataframe chunks should extract the data type from the column definitions - at least if the values are all nulls
Steps to reproduce
Create a database with a table with only NULLs in the first rows. GIve it a date type that is not string. Then set the chunks to be smaller than the number of leading NULLs.
Operating system
macOS
Runtime environment
Airflow
Python version
3.11
dlt data source
sql_database
dlt destination
Postgres
Other deployment details
No response
Additional information
No response