bug: Ambiguous column error when loading different case columns via COPY INTO
Search before asking
- [X] I had searched in the issues and found no similar issues.
Version
v1.2.635-nightly
What's Wrong?
When trying to load data via COPY INTO from a parquet file with case sensitive columns, an ambiguous column error is reported. Looks to be from https://github.com/datafuselabs/databend/blob/2c41faf63b5136ab22a2ab3ac972c9752dc4a7bf/src/query/sql/src/planner/binder/bind_context.rs#L301
The parquet file contains columns such as c1 and C1 which exist within the target table.
How to Reproduce?
Sorry not easy for me to write a script to replicate, but easy to describe
- Create parquet file with columns that are case sensitive e.g.
c1&C1 - Create a table to receive these columns
create table t_sensitive ("c1" int, "C1" int); - Stage the parquet file
- Load parquet file via COPY INTO
COPY INTO t_sensitive
FROM 'gcs://bucket/folder/'
CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>')
FILE_FORMAT = (TYPE = PARQUET)
FORCE = TRUE;
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
The ambiguous column error you're encountering when using the COPY INTO command with case-sensitive columns (c1 and C1) is likely due to the logic at line 301 in bind_context.rs. This logic checks if there are multiple tables in the metadata that are not sources of an index, and if so, it raises a SemanticError indicating that the table of the internal column is ambiguous:
if tables.len() > 1 {
return Err(ErrorCode::SemanticError(format!(
"The table of the internal column `{}` is ambiguous",
column_binding.internal_column.column_name()
)));
}
This error can occur if the column names are not being correctly distinguished based on case sensitivity during the binding process [1].
To address this issue, you may need to ensure that the normalize_identifier function correctly handles case sensitivity for both quoted and unquoted identifiers. The normalize_identifier function is used to normalize identifiers based on the case sensitivity settings defined in the NameResolutionContext:
pub fn normalize_identifier(ident: &Identifier, context: &NameResolutionContext) -> Identifier {
if (ident.is_quoted() && context.quoted_ident_case_sensitive)
|| (!ident.is_quoted() && context.unquoted_ident_case_sensitive)
{
ident.clone()
} else {
Identifier::from_name(ident.span, ident.name.to_lowercase())
}
}
You should verify that the NameResolutionContext is configured correctly for your use case, ensuring that quoted identifiers are treated as case-sensitive and unquoted identifiers are handled according to your requirements [2].
If the issue persists, you might need to review the logic in the COPY INTO command's processing of Parquet files to ensure that it correctly handles case-sensitive column names. The ParquetCopySource struct and its methods (event, process, and async_process) are responsible for processing Parquet files during the COPY INTO operation [3][4][5].
By ensuring that the case sensitivity settings and the identifier normalization logic are correctly implemented, you should be able to resolve the ambiguous column error.
To continue talking to Dosu, mention @dosu.
@youngsofun might be able to assist with this.
Any ideas on this @youngsofun ? Is there any setting I could try to adjust?
when copy from parquet fields are matched by name. currently๏ผ when reading parquet file, the fields of file schema is modified that all field names are turned to lowercase.
the Available solution is to use transform
create stage stage1 url='gcs://bucket/folder/' CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>');
COPY INTO t_sensitive from (select * FROM @stage1) force=true;
and I am working on this https://github.com/databendlabs/databend/issues/16897 for better support of this issue