sqlmesh icon indicating copy to clipboard operation
sqlmesh copied to clipboard

duckdb: sqlmesh changes case of "columns" argument of read_json (and friends) resulting in null values with mixed-case keys

Open seandavi opened this issue 6 months ago • 1 comments

When rendering column names from the read_json and friends in duckdb, case is not maintained. This result is a table with the correct structure, but with NULL values for all fields. If I manually use the rendered SQL but fix the column names to the correct case, I get a complete table.

I recognize this is probably wonky duckdb behavior. If there is a workaround for sqlmesh, though, to make this work, I'd much appreciate it.

select
    *
FROM
    read_ndjson(
        '/tmp/omicidx/nih_reporter/nih_reporter_publications*.jsonl.gz',auto_detect=true,columns = {
            AFFILIATION: 'TEXT',
            AUTHOR_LIST: 'TEXT',
            COUNTRY: 'TEXT',
            ISSN: 'TEXT',
            JOURNAL_ISSUE: 'TEXT',
            JOURNAL_TITLE: 'TEXT',
            JOURNAL_TITLE_ABBR: 'TEXT',
            JOURNAL_VOLUME: 'TEXT',
            LANG: 'TEXT',
            PAGE_NUMBER: 'TEXT',
            PMC_ID: 'TEXT',
            PMID: 'BIGINT',
            PUB_DATE: 'TEXT',
            PUB_TITLE: 'TEXT',
            PUB_YEAR: 'TEXT'
        }
    );

renders as:

SELECT                                                                                        
  *                                                                                           
FROM READ_NDJSON(                                                                             
  '/tmp/omicidx/nih_reporter/nih_reporter_publications*.jsonl.gz',                            
  "auto_detect" = TRUE,                                                                       
  "columns" = {'affiliation': 'TEXT', 'author_list': 'TEXT', 'country': 'TEXT', 'issn':       
'TEXT', 'journal_issue': 'TEXT', 'journal_title': 'TEXT', 'journal_title_abbr': 'TEXT',       
'journal_volume': 'TEXT', 'lang': 'TEXT', 'page_number': 'TEXT', 'pmc_id': 'TEXT', 'pmid':    
'BIGINT', 'pub_date': 'TEXT', 'pub_title': 'TEXT', 'pub_year': 'TEXT'}                        
) AS "_q_0"

If you need a sample of the file, let me know. I can pass along a few lines for testing.

seandavi avatar Aug 15 '25 14:08 seandavi

DuckDB treats unquoted identifiers as lowercase and SQLMesh makes this explicit at the normalization step.

When specifying the columns, does it let you quote the column names? eg:

select * from read_ndjson(
        '/path/to/*.jsonl.gz', ..., columns = {
            "AFFILIATION": 'TEXT',
            "AUTHOR_LIST": 'TEXT',
          ...etc
       }
)

If you can get a quoted identifier, SQLMesh will preserve the case through the normalization step

erindru avatar Aug 17 '25 22:08 erindru