bug: `extractNormalizedInsertQueryAndColumns` does not remove backticks inside a dot-delimited column name
Observed
extractNormalizedInsertQueryAndColumns cleans up leading and trailing backticks in columns such as "events.attributes" but leaves the internal backticks untouched leading to a block cannot be sorted error due to the mismatch.
This affects Nested type columns used via go-gorm.
Expected behaviour
Ideally, extractNormalizedInsertQueryAndColumns should handle this case.
Working `extractNormalizedInsertQueryColumns`
func extractNormalizedInsertQueryAndColumns(query string) (normalizedQuery string, tableName string, columns []string, err error) {
query = truncateFormat.ReplaceAllString(query, "")
query = truncateValues.ReplaceAllString(query, "")
matches := normalizeInsertQueryMatch.FindStringSubmatch(query)
if len(matches) == 0 {
err = errors.Errorf("invalid INSERT query: %s", query)
return
}
normalizedQuery = fmt.Sprintf("%s FORMAT Native", matches[1])
tableName = strings.TrimSpace(matches[2])
columns = make([]string, 0)
matches = extractInsertColumnsMatch.FindStringSubmatch(matches[1])
if len(matches) == 2 {
columns = strings.Split(matches[1], ",")
for i := range columns {
// refers to https://clickhouse.com/docs/en/sql-reference/syntax#identifiers
// we can use identifiers with double quotes or backticks, for example: "id", `id`, but not both, like `"id"`.
col := columns[i]
col = strings.TrimSpace(columns[i]) // trim lead/trail -in whitespace
col = strings.Trim(col, "\"") // trim lead/trail -ing double quote
col = strings.Trim(col, "`") // trim lead/trial -ing tilde
// for columns like events.attributes that are otherwise represented like this: events`.`attributes
col = strings.ReplaceAll(col, "`.`", ".") // replace internal tilde delimited periods w/ plain period
columns[i] = col
}
}
return
}
`show create traces`
SHOW CREATE TABLE ctl_api.otel_traces
Query id: 69b8f8cf-71c2-4967-8cba-d80945a4be78
┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │ CREATE TABLE ctl_api.otel_traces
(
`id` String,
`created_by_id` String,
`created_at` DateTime64(3),
`updated_at` DateTime64(3),
`deleted_at` UInt64,
`runner_id` String,
`runner_job_id` String,
`runner_job_execution_id` String,
`timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
`timestamp_date` Date DEFAULT toDate(timestamp),
`timestamp_time` DateTime DEFAULT toDateTime(timestamp),
`resource_attributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`resource_schema_url` LowCardinality(String) CODEC(ZSTD(1)),
`scope_name` String CODEC(ZSTD(1)),
`scope_version` LowCardinality(String) CODEC(ZSTD(1)),
`scope_attributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`scope_dropped_attr_count` Int64,
`scope_schema_url` LowCardinality(String) CODEC(ZSTD(1)),
`trace_id` String CODEC(ZSTD(1)),
`span_id` String CODEC(ZSTD(1)),
`parent_span_id` String CODEC(ZSTD(1)),
`trace_state` String CODEC(ZSTD(1)),
`span_name` LowCardinality(String) CODEC(ZSTD(1)),
`span_kind` LowCardinality(String) CODEC(ZSTD(1)),
`service_name` LowCardinality(String) CODEC(ZSTD(1)),
`span_attributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`duration` Int64 CODEC(ZSTD(1)),
`status_code` LowCardinality(String) CODEC(ZSTD(1)),
`status_message` String CODEC(ZSTD(1)),
`events.timestamp` Array(DateTime64(9)),
`events.name` Array(LowCardinality(String)),
`events.attributes` Array(Map(LowCardinality(String), String)),
`links.trace_id` Array(String),
`links.span_id` Array(String),
`links.span_state` Array(String),
`links.attributes` Array(Map(LowCardinality(String), String)),
INDEX idx_trace_id trace_id TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_span_attr_key mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
INDEX idx_span_attr_value mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
INDEX idx_res_attr_key mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
INDEX idx_res_attr_value mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
INDEX idx_scope_attr_key mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
INDEX idx_scope_attr_value mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (service_name, span_name, toUnixTimestamp(timestamp), trace_id)
TTL toDateTime(timestamp) + toIntervalDay(180)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.004 sec.
Code example
package code
// given an auto-migrated struct like this
type Trace struct {
Events []OtelTraceEvent `gorm:"type:Nested(timestamp DateTime64(9), name LowCardinality(String), attributes Map(LowCardinality(String), String));"` // order matters, dont't touch nested def
Links []OtelTraceLink `gorm:"type:Nested(trace_id String, span_id String, span_state String, attributes Map(LowCardinality(String), String));"` // order matters, dont't touch nested def
}
// and a non-migrated struct like this
type TraceIngestion struct {
EventsTimestamp []time.Time `gorm:"type:DateTime64(9);column:events.timestamp"`
EventsName []string `gorm:"type:LowCardinality(String);column:events.name"`
EventsAttributes []map[string]string `gorm:"type:Map(LowCardinality(String), String);column:events.attributes"`
LinksTraceID []string `gorm:"type:LowCardinality(String);column:links.trace_id"`
LinksSpanID []string `gorm:"type:LowCardinality(String);column:links.span_id"`
LinksState []string `gorm:"type:LowCardinality(String);column:links.span_state"`
LinksAttributes []map[string]string `gorm:"type:Map(LowCardinality(String), String);column:links.attributes"`
}
Error log
block cannot be sorted - missing columns in requested order: [events.timestamp events.name events.attributes]"
Details
Environment
- [x]
clickhouse-goversion: 2.28.2 - [x] Interface:
go-gorm/clickhouse - [x] Go version: 1.22.0
- [x] Operating system: OS X
- [x] ClickHouse version: 24.9.1.628
- [x] Is it a ClickHouse Cloud? No
- [x] ClickHouse Server non-default settings, if any: -
- [x]
CREATE TABLEstatements for tables involved: - [ ] Sample data for all these tables, use clickhouse-obfuscator if necessary
@fidiego thanks for reporting this. Could you (if possible) post an INSERT statement generated by go-gorm?
sure thing @jkaflik.
INSERT SQL
INSERT INTO
`otel_traces` (
`id`,
`created_by_id`,
`created_at`,
`updated_at`,
`deleted_at`,
`runner_id`,
`runner_job_id`,
`runner_job_execution_id`,
`timestamp`,
`timestamp_date`,
`timestamp_time`,
`resource_attributes`,
`resource_schema_url`,
`scope_name`,
`scope_version`,
`scope_attributes`,
`scope_dropped_attr_count`,
`scope_schema_url`,
`trace_id`,
`span_id`,
`parent_span_id`,
`trace_state`,
`span_name`,
`span_kind`,
`service_name`,
`span_attributes`,
`duration`,
`status_code`,
`status_message`,
`events`.`timestamp`,
`events`.`name`,
`events`.`attributes`
)
VALUES
(
'',
'',
'2024-09-09 18:49:12.765',
'2024-09-09 18:49:12.765',
0,
'runccccccccccccccccccccccc',
'',
'',
'2018-12-13 08:51:00',
'0000-00-00 00:00:00',
'0000-00-00 00:00:00',
{ 'job.id' :'jobidentfier',
'service.name': 'runner-runxxxxxxxxxxxxxxxxxxxxxxxxxxx' },
'',
'my.library',
'1.0.0',
{'my.scope.attribute':'some scope attribute'},
0,
'',
'5bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
'eee19b7ec3c1b174',
'eee19b7ec3c1b173',
'',
'I''m a server span',
'Server',
'runner-runxxxxxxxxxxxxxxxxxxxxxxxxxxx',
{'my.span.attr':'some value'},
1,
'Unset',
'',
(
'2018-12-13 14:51:01',
'2020-02-11 20:26:13',
'2020-02-11 20:26:13'
),
('ebento1', 'event-with-attr', 'event'),
(
[],
{ 'span-event-attr' :'span-event-attr-val' } ',
[]
)
)
hi @jkaflik, following up on this. if it is useful, i can set up a reproduction.
@fidiego sorry for delayed response. Yes, please.
@jkaflik no worries.
I've set up a PoC here. https://github.com/fidiego/ch-go-poc
In my actual code I define a struct for the migration w/ gorm:"type:Nested(...) (like exemplars in the poc) and a second struct for reading with explicit columns (like events.* in the poc). Because it's simpler than writing scanner/value code. Read works great, but creating/migrating w/ the dot-delimited columns raises an error.
The dot-delimited fields work as event.column but not as events.column.
hi @jkaflik, i hope all is well. i wanted to ask if there was anything you needed from my end or if there were any updates on this issue.
@fidiego, thanks for submitting a reproduction DDL. I will take a look this week. Apologies, my focus was shift on another project.
hi @jkaflik and @SpencerTorres, i just wanted to circle back on this. there is no urgency on my end but i am curious as to whether or not this is something that can be addressed in clickhouse-go.
Hey @fidiego!
I wonder if this can be fixed on the go-gorm side? Based on the SHOW CREATE TABLE syntax, the preferred snytax is
`events.timestamp`
Is there any way to make dialect-specific changes for this one node in the AST?
I've seen this issue before with this function where it is unable to determine the columns within the INSERT. Another solution we can consider is adding a context key where you can specifically provide the column names you want to insert. This way you're not blocked by what the regex parses. I understand this may not be what you want if you're using an ORM though.
If it doesn't break backwards compatibility we could also just merge your proposed change. What are your thoughts?
I've seen this issue before
Here's the related issue I was thinking of https://github.com/ClickHouse/clickhouse-go/issues/1485#issuecomment-2632413186 (see Further Actions section)
Further examples of potential fix in https://github.com/ClickHouse/clickhouse-go/issues/1485#issuecomment-2632549572
@SpencerTorres I'm not sure if it sounds good to introduce clickhouse-sql-parser to parse the query instead of using regex. I would be happy to submit a PR if this sounds good to you.
@git-hulk I'm concerned about the memory usage and latency this would add just to extract the column names. I'm also cautious about adding new dependencies.
I think we have some other fixes in mind that I would like to try first before going that route. Thanks for the suggestion though, I might re-use that module for something else
@git-hulk I'm concerned about the memory usage and latency this would add just to extract the column names. I'm also cautious about adding new dependencies.
I think we have some other fixes in mind that I would like to try first before going that route. Thanks for the suggestion though, I might re-use that module for something else
@SpencerTorres Thanks for your reply.
I was facing a problem with quoted column names containing a space followed by an opening parenthesis, which was preventing me from doing an ETL job.
A sample query that could not be correctly parsed via the current regex implementation: INSERT INTO db.table (`ITEM`, `QTY (MT)`)
I have written a simple parser here .
The readme shows an example that identifies columns correctly.
The tests also have many examples.
It is 20% faster than the current regex based solution. Let me know if you want me to create a PR which adds it to the clickhouse-go codebase.