duckdb: sqlmesh changes case of "columns" argument of read_json (and friends) resulting in null values with mixed-case keys
When rendering column names from the read_json and friends in duckdb, case is not maintained. This result is a table with the correct structure, but with NULL values for all fields. If I manually use the rendered SQL but fix the column names to the correct case, I get a complete table.
I recognize this is probably wonky duckdb behavior. If there is a workaround for sqlmesh, though, to make this work, I'd much appreciate it.
select
*
FROM
read_ndjson(
'/tmp/omicidx/nih_reporter/nih_reporter_publications*.jsonl.gz',auto_detect=true,columns = {
AFFILIATION: 'TEXT',
AUTHOR_LIST: 'TEXT',
COUNTRY: 'TEXT',
ISSN: 'TEXT',
JOURNAL_ISSUE: 'TEXT',
JOURNAL_TITLE: 'TEXT',
JOURNAL_TITLE_ABBR: 'TEXT',
JOURNAL_VOLUME: 'TEXT',
LANG: 'TEXT',
PAGE_NUMBER: 'TEXT',
PMC_ID: 'TEXT',
PMID: 'BIGINT',
PUB_DATE: 'TEXT',
PUB_TITLE: 'TEXT',
PUB_YEAR: 'TEXT'
}
);
renders as:
SELECT
*
FROM READ_NDJSON(
'/tmp/omicidx/nih_reporter/nih_reporter_publications*.jsonl.gz',
"auto_detect" = TRUE,
"columns" = {'affiliation': 'TEXT', 'author_list': 'TEXT', 'country': 'TEXT', 'issn':
'TEXT', 'journal_issue': 'TEXT', 'journal_title': 'TEXT', 'journal_title_abbr': 'TEXT',
'journal_volume': 'TEXT', 'lang': 'TEXT', 'page_number': 'TEXT', 'pmc_id': 'TEXT', 'pmid':
'BIGINT', 'pub_date': 'TEXT', 'pub_title': 'TEXT', 'pub_year': 'TEXT'}
) AS "_q_0"
If you need a sample of the file, let me know. I can pass along a few lines for testing.
DuckDB treats unquoted identifiers as lowercase and SQLMesh makes this explicit at the normalization step.
When specifying the columns, does it let you quote the column names? eg:
select * from read_ndjson(
'/path/to/*.jsonl.gz', ..., columns = {
"AFFILIATION": 'TEXT',
"AUTHOR_LIST": 'TEXT',
...etc
}
)
If you can get a quoted identifier, SQLMesh will preserve the case through the normalization step