databend icon indicating copy to clipboard operation
databend copied to clipboard

bug: Ambiguous column error when loading different case columns via COPY INTO

Open rad-pat opened this issue 1 year ago โ€ข 1 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Version

v1.2.635-nightly

What's Wrong?

When trying to load data via COPY INTO from a parquet file with case sensitive columns, an ambiguous column error is reported. Looks to be from https://github.com/datafuselabs/databend/blob/2c41faf63b5136ab22a2ab3ac972c9752dc4a7bf/src/query/sql/src/planner/binder/bind_context.rs#L301

The parquet file contains columns such as c1 and C1 which exist within the target table.

How to Reproduce?

Sorry not easy for me to write a script to replicate, but easy to describe

  1. Create parquet file with columns that are case sensitive e.g. c1 & C1
  2. Create a table to receive these columns create table t_sensitive ("c1" int, "C1" int);
  3. Stage the parquet file
  4. Load parquet file via COPY INTO
COPY INTO t_sensitive
FROM 'gcs://bucket/folder/'
CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>')
FILE_FORMAT = (TYPE = PARQUET)
FORCE = TRUE;

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

rad-pat avatar Sep 19 '24 10:09 rad-pat

The ambiguous column error you're encountering when using the COPY INTO command with case-sensitive columns (c1 and C1) is likely due to the logic at line 301 in bind_context.rs. This logic checks if there are multiple tables in the metadata that are not sources of an index, and if so, it raises a SemanticError indicating that the table of the internal column is ambiguous:

if tables.len() > 1 {
    return Err(ErrorCode::SemanticError(format!(
        "The table of the internal column `{}` is ambiguous",
        column_binding.internal_column.column_name()
    )));
}

This error can occur if the column names are not being correctly distinguished based on case sensitivity during the binding process [1].

To address this issue, you may need to ensure that the normalize_identifier function correctly handles case sensitivity for both quoted and unquoted identifiers. The normalize_identifier function is used to normalize identifiers based on the case sensitivity settings defined in the NameResolutionContext:

pub fn normalize_identifier(ident: &Identifier, context: &NameResolutionContext) -> Identifier {
    if (ident.is_quoted() && context.quoted_ident_case_sensitive)
        || (!ident.is_quoted() && context.unquoted_ident_case_sensitive)
    {
        ident.clone()
    } else {
        Identifier::from_name(ident.span, ident.name.to_lowercase())
    }
}

You should verify that the NameResolutionContext is configured correctly for your use case, ensuring that quoted identifiers are treated as case-sensitive and unquoted identifiers are handled according to your requirements [2].

If the issue persists, you might need to review the logic in the COPY INTO command's processing of Parquet files to ensure that it correctly handles case-sensitive column names. The ParquetCopySource struct and its methods (event, process, and async_process) are responsible for processing Parquet files during the COPY INTO operation [3][4][5].

By ensuring that the case sensitivity settings and the identifier normalization logic are correctly implemented, you should be able to resolve the ambiguous column error.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 19 '24 10:09 dosubot[bot]

@youngsofun might be able to assist with this.

BohuTANG avatar Nov 15 '24 09:11 BohuTANG

Any ideas on this @youngsofun ? Is there any setting I could try to adjust?

rad-pat avatar Nov 20 '24 11:11 rad-pat

when copy from parquet fields are matched by name. currently๏ผŒ when reading parquet file, the fields of file schema is modified that all field names are turned to lowercase.

the Available solution is to use transform

  create stage stage1  url='gcs://bucket/folder/'  CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>');
  COPY INTO t_sensitive from (select * FROM @stage1) force=true;

youngsofun avatar Nov 21 '24 03:11 youngsofun

and I am working on this https://github.com/databendlabs/databend/issues/16897 for better support of this issue

youngsofun avatar Nov 21 '24 03:11 youngsofun