clickhouse-go icon indicating copy to clipboard operation
clickhouse-go copied to clipboard

bug: `extractNormalizedInsertQueryAndColumns` does not remove backticks inside a dot-delimited column name

Open fidiego opened this issue 1 year ago • 14 comments

Observed

extractNormalizedInsertQueryAndColumns cleans up leading and trailing backticks in columns such as "events.attributes" but leaves the internal backticks untouched leading to a block cannot be sorted error due to the mismatch.

This affects Nested type columns used via go-gorm.

Expected behaviour

Ideally, extractNormalizedInsertQueryAndColumns should handle this case.

Working `extractNormalizedInsertQueryColumns`

func extractNormalizedInsertQueryAndColumns(query string) (normalizedQuery string, tableName string, columns []string, err error) {
	query = truncateFormat.ReplaceAllString(query, "")
	query = truncateValues.ReplaceAllString(query, "")

	matches := normalizeInsertQueryMatch.FindStringSubmatch(query)
	if len(matches) == 0 {
		err = errors.Errorf("invalid INSERT query: %s", query)
		return
	}

	normalizedQuery = fmt.Sprintf("%s FORMAT Native", matches[1])
	tableName = strings.TrimSpace(matches[2])

	columns = make([]string, 0)
	matches = extractInsertColumnsMatch.FindStringSubmatch(matches[1])
	if len(matches) == 2 {
		columns = strings.Split(matches[1], ",")
		for i := range columns {
			// refers to https://clickhouse.com/docs/en/sql-reference/syntax#identifiers
			// we can use identifiers with double quotes or backticks, for example: "id", `id`, but not both, like `"id"`.
			col := columns[i]
			col = strings.TrimSpace(columns[i]) // trim lead/trail -in whitespace
			col = strings.Trim(col, "\"")       // trim lead/trail -ing double quote
			col = strings.Trim(col, "`")        // trim lead/trial -ing tilde
			// for columns like events.attributes that are otherwise represented like this: events`.`attributes
			col = strings.ReplaceAll(col, "`.`", ".") // replace internal tilde delimited periods w/ plain period
			columns[i] = col
		}
	}

	return
}

`show create traces`

SHOW CREATE TABLE ctl_api.otel_traces

Query id: 69b8f8cf-71c2-4967-8cba-d80945a4be78

   ┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │ CREATE TABLE ctl_api.otel_traces
(
    `id` String,
    `created_by_id` String,
    `created_at` DateTime64(3),
    `updated_at` DateTime64(3),
    `deleted_at` UInt64,
    `runner_id` String,
    `runner_job_id` String,
    `runner_job_execution_id` String,
    `timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `timestamp_date` Date DEFAULT toDate(timestamp),
    `timestamp_time` DateTime DEFAULT toDateTime(timestamp),
    `resource_attributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `resource_schema_url` LowCardinality(String) CODEC(ZSTD(1)),
    `scope_name` String CODEC(ZSTD(1)),
    `scope_version` LowCardinality(String) CODEC(ZSTD(1)),
    `scope_attributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `scope_dropped_attr_count` Int64,
    `scope_schema_url` LowCardinality(String) CODEC(ZSTD(1)),
    `trace_id` String CODEC(ZSTD(1)),
    `span_id` String CODEC(ZSTD(1)),
    `parent_span_id` String CODEC(ZSTD(1)),
    `trace_state` String CODEC(ZSTD(1)),
    `span_name` LowCardinality(String) CODEC(ZSTD(1)),
    `span_kind` LowCardinality(String) CODEC(ZSTD(1)),
    `service_name` LowCardinality(String) CODEC(ZSTD(1)),
    `span_attributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    `duration` Int64 CODEC(ZSTD(1)),
    `status_code` LowCardinality(String) CODEC(ZSTD(1)),
    `status_message` String CODEC(ZSTD(1)),
    `events.timestamp` Array(DateTime64(9)),
    `events.name` Array(LowCardinality(String)),
    `events.attributes` Array(Map(LowCardinality(String), String)),
    `links.trace_id` Array(String),
    `links.span_id` Array(String),
    `links.span_state` Array(String),
    `links.attributes` Array(Map(LowCardinality(String), String)),
    INDEX idx_trace_id trace_id TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_span_attr_key mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
    INDEX idx_span_attr_value mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
    INDEX idx_res_attr_key mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
    INDEX idx_res_attr_value mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
    INDEX idx_scope_attr_key mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1,
    INDEX idx_scope_attr_value mapKeys(resource_attributes) TYPE bloom_filter(0.1) GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (service_name, span_name, toUnixTimestamp(timestamp), trace_id)
TTL toDateTime(timestamp) + toIntervalDay(180)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1 │
   └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.004 sec.

Code example

package code

// given an auto-migrated struct like this

type Trace struct {
	Events         []OtelTraceEvent  `gorm:"type:Nested(timestamp DateTime64(9), name LowCardinality(String), attributes Map(LowCardinality(String), String));"` // order matters, dont't touch nested def
	Links          []OtelTraceLink   `gorm:"type:Nested(trace_id String, span_id String, span_state String, attributes Map(LowCardinality(String), String));"`   // order matters, dont't touch nested def
}

// and a non-migrated struct like this
type TraceIngestion struct {
	EventsTimestamp  []time.Time         `gorm:"type:DateTime64(9);column:events.timestamp"`
	EventsName       []string            `gorm:"type:LowCardinality(String);column:events.name"`
	EventsAttributes []map[string]string `gorm:"type:Map(LowCardinality(String), String);column:events.attributes"`
	LinksTraceID     []string            `gorm:"type:LowCardinality(String);column:links.trace_id"`
	LinksSpanID      []string            `gorm:"type:LowCardinality(String);column:links.span_id"`
	LinksState       []string            `gorm:"type:LowCardinality(String);column:links.span_state"`
	LinksAttributes  []map[string]string `gorm:"type:Map(LowCardinality(String), String);column:links.attributes"`
}

Error log

block cannot be sorted - missing columns in requested order: [events.timestamp events.name events.attributes]"

Details

Environment

  • [x] clickhouse-go version: 2.28.2
  • [x] Interface: go-gorm/clickhouse
  • [x] Go version: 1.22.0
  • [x] Operating system: OS X
  • [x] ClickHouse version: 24.9.1.628
  • [x] Is it a ClickHouse Cloud? No
  • [x] ClickHouse Server non-default settings, if any: -
  • [x] CREATE TABLE statements for tables involved:
  • [ ] Sample data for all these tables, use clickhouse-obfuscator if necessary

fidiego avatar Sep 10 '24 19:09 fidiego

@fidiego thanks for reporting this. Could you (if possible) post an INSERT statement generated by go-gorm?

jkaflik avatar Sep 12 '24 06:09 jkaflik

sure thing @jkaflik.

INSERT SQL

INSERT INTO
    `otel_traces` (
        `id`,
        `created_by_id`,
        `created_at`,
        `updated_at`,
        `deleted_at`,
        `runner_id`,
        `runner_job_id`,
        `runner_job_execution_id`,
        `timestamp`,
        `timestamp_date`,
        `timestamp_time`,
        `resource_attributes`,
        `resource_schema_url`,
        `scope_name`,
        `scope_version`,
        `scope_attributes`,
        `scope_dropped_attr_count`,
        `scope_schema_url`,
        `trace_id`,
        `span_id`,
        `parent_span_id`,
        `trace_state`,
        `span_name`,
        `span_kind`,
        `service_name`,
        `span_attributes`,
        `duration`,
        `status_code`,
        `status_message`,
        `events`.`timestamp`,
        `events`.`name`,
        `events`.`attributes`
    )
VALUES
    (
        '',
        '',
        '2024-09-09 18:49:12.765',
        '2024-09-09 18:49:12.765',
        0,
        'runccccccccccccccccccccccc',
        '',
        '',
        '2018-12-13 08:51:00',
        '0000-00-00 00:00:00',
        '0000-00-00 00:00:00',
        { 'job.id' :'jobidentfier',
        'service.name': 'runner-runxxxxxxxxxxxxxxxxxxxxxxxxxxx' },
        '',
        'my.library',
        '1.0.0',
        {'my.scope.attribute':'some scope attribute'},
        0,
        '',
        '5bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
        'eee19b7ec3c1b174',
        'eee19b7ec3c1b173',
        '',
        'I''m a server span',
        'Server',
        'runner-runxxxxxxxxxxxxxxxxxxxxxxxxxxx',
        {'my.span.attr':'some value'},
        1,
        'Unset',
        '',
        (
            '2018-12-13 14:51:01',
            '2020-02-11 20:26:13',
            '2020-02-11 20:26:13'
        ),
        ('ebento1', 'event-with-attr', 'event'),
        (
            [],
            { 'span-event-attr' :'span-event-attr-val' } ',
            []
        )
    )

fidiego avatar Sep 12 '24 15:09 fidiego

hi @jkaflik, following up on this. if it is useful, i can set up a reproduction.

fidiego avatar Sep 20 '24 14:09 fidiego

@fidiego sorry for delayed response. Yes, please.

jkaflik avatar Sep 27 '24 07:09 jkaflik

@jkaflik no worries.

I've set up a PoC here. https://github.com/fidiego/ch-go-poc

In my actual code I define a struct for the migration w/ gorm:"type:Nested(...) (like exemplars in the poc) and a second struct for reading with explicit columns (like events.* in the poc). Because it's simpler than writing scanner/value code. Read works great, but creating/migrating w/ the dot-delimited columns raises an error.

The dot-delimited fields work as event.column but not as events.column.

fidiego avatar Oct 01 '24 03:10 fidiego

hi @jkaflik, i hope all is well. i wanted to ask if there was anything you needed from my end or if there were any updates on this issue.

fidiego avatar Oct 09 '24 16:10 fidiego

@fidiego, thanks for submitting a reproduction DDL. I will take a look this week. Apologies, my focus was shift on another project.

jkaflik avatar Oct 15 '24 08:10 jkaflik

hi @jkaflik and @SpencerTorres, i just wanted to circle back on this. there is no urgency on my end but i am curious as to whether or not this is something that can be addressed in clickhouse-go.

fidiego avatar Dec 11 '24 15:12 fidiego

Hey @fidiego!

I wonder if this can be fixed on the go-gorm side? Based on the SHOW CREATE TABLE syntax, the preferred snytax is

`events.timestamp`

Is there any way to make dialect-specific changes for this one node in the AST?

I've seen this issue before with this function where it is unable to determine the columns within the INSERT. Another solution we can consider is adding a context key where you can specifically provide the column names you want to insert. This way you're not blocked by what the regex parses. I understand this may not be what you want if you're using an ORM though.

If it doesn't break backwards compatibility we could also just merge your proposed change. What are your thoughts?

SpencerTorres avatar Mar 04 '25 16:03 SpencerTorres

I've seen this issue before

Here's the related issue I was thinking of https://github.com/ClickHouse/clickhouse-go/issues/1485#issuecomment-2632413186 (see Further Actions section)

Further examples of potential fix in https://github.com/ClickHouse/clickhouse-go/issues/1485#issuecomment-2632549572

SpencerTorres avatar Mar 08 '25 01:03 SpencerTorres

@SpencerTorres I'm not sure if it sounds good to introduce clickhouse-sql-parser to parse the query instead of using regex. I would be happy to submit a PR if this sounds good to you.

git-hulk avatar Apr 23 '25 11:04 git-hulk

@git-hulk I'm concerned about the memory usage and latency this would add just to extract the column names. I'm also cautious about adding new dependencies.

I think we have some other fixes in mind that I would like to try first before going that route. Thanks for the suggestion though, I might re-use that module for something else

SpencerTorres avatar May 05 '25 15:05 SpencerTorres

@git-hulk I'm concerned about the memory usage and latency this would add just to extract the column names. I'm also cautious about adding new dependencies.

I think we have some other fixes in mind that I would like to try first before going that route. Thanks for the suggestion though, I might re-use that module for something else

@SpencerTorres Thanks for your reply.

git-hulk avatar May 06 '25 01:05 git-hulk

I was facing a problem with quoted column names containing a space followed by an opening parenthesis, which was preventing me from doing an ETL job. A sample query that could not be correctly parsed via the current regex implementation: INSERT INTO db.table (`ITEM`, `QTY (MT)`) I have written a simple parser here .
The readme shows an example that identifies columns correctly. The tests also have many examples. It is 20% faster than the current regex based solution. Let me know if you want me to create a PR which adds it to the clickhouse-go codebase.

moong-dal avatar Jun 03 '25 06:06 moong-dal