vector
vector copied to clipboard
Default Vector timestamp output format is incompatible with default Clickhouse settings
Vector Version
vector 0.11.1 (v0.11.1 x86_64-unknown-linux-musl 2020-12-17)
Vector Configuration File
[sinks.haproxy_ch]
type = "clickhouse"
inputs = ["sortout.haproxy_access"]
endpoint = "http://127.0.0.1:8123"
database = "logs"
table = "haproxy"
encoding.only_fields = [ "timestamp", "..."]
Field timestamp previously coerced to Vector timestamp
[transforms.json_coercer]
type = "coercer"
inputs = ["json_parser"]
types."time" = "timestamp|%e/%B/%Y:%T %z"
Column timestamp in Clickhouse is DateTime
CREATE TABLE logs.haproxy
(
`timestamp` DateTime,
...
Expected Behavior
Default Vector output timestamp format is expected to be compatible with default Clickhouse timestamp input format.
Actual Behavior
It is not. Clickhouse' DateTime parsing is determined by date_time_input_format setting, which allows two options:
basic— as documentation says, "ClickHouse can parse only the basic YYYY-MM-DD HH:MM:SS or YYYY-MM-DD format". This is the default behavior.best_effort— can also parse ISO 8601 formats.
Vector' Clickhouse sink have option encoding.timestamp_format, which also may accept two options: rfc3339 (the default) or unix.
When using default Vector Clickhouse sink with default encoding.timestamp_format together with Clickhouse with default date_time_input_format, Clickhouse rejects data with error
Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected '\"' before: 'Z\"}\\n': (while read the value of key timestamp): (at row 1)\n`
Proposed solutions
Setting Vector to use Unix timestamps in Clickhouse sink solves problem for me:
encoding.timestamp_format = "unix"
However, I think this must be at least noted in the documentation.
Also, maybe it is worth to consider changing default value encoding.timestamp_format to unix, to get the correct behavior "out of the box".
Adding new timestamp_format option basic may also be an option, however, this cannot be the optimal default value IMHO, because timezone interpretation by Clickhouse may depend on column setting or, if none set, from system timezone, which is may be useful only when selected explicitly, i.e. user knows what she/he is doing.
References
https://github.com/timberio/vector/pull/1634
I'm using the following workaround in VRL to format my timestamp in the default way ClickHouse expects:
.ts = format_timestamp(.ts, "%F %T%.9f") ?? .ts
I read in a ClickHouse issue that the basic date format is faster than the best_effort setting. I can't find the issue anymore, but the best effort path tries parsing a bunch of ways so it make sense.