clickhouse-go icon indicating copy to clipboard operation
clickhouse-go copied to clipboard

Establish batch semantics

Open gingerwizard opened this issue 2 years ago • 4 comments

Some challenges with batches:

  1. For columnar insertion we provide Append methods on columns e.g. https://github.com/ClickHouse/clickhouse-go/blob/v2/examples/native/write-columnar/main.go.

    For some types, Append can fail e.g. when parsing Strings as IP addresses. Currently, in the event of failure we have inconsistent semantics. Consider:

    • ips - either all values are inserted or nothing. i.e. https://github.com/ClickHouse/clickhouse-go/blob/v2/lib/column/ipv4.go#L100-L110
    • bigints (int128,int256) - the batch can be partially modified in the event of failure i.e. https://github.com/ClickHouse/clickhouse-go/blob/v2/lib/column/bigint.go#L79-L94

    The former is faster - it saves iterations, at the cost of trickier error recovery for the user.

  2. Append to a batch can fail e.g. the enum doesn't map https://github.com/ClickHouse/clickhouse-go/issues/703. This causes a release of the connection and means subsequent use of the batch fails with a panic.

We should decide on the semantics, document them and be consistent.

@ernado @genzgd

gingerwizard avatar Jul 08 '22 15:07 gingerwizard

Do I understand correctly the problem might happen when transforming data for the insert action?

IMO we should align with the ClickHouse default logic for such cases. ClickHouse implements fail-fast pattern, even though it's configurable. From input_format_allow_errors_num docs:

Sets the maximum number of acceptable errors when reading from text formats (CSV, TSV, etc.). The default value is 0.

I'd suggest avoiding the complexity of making this configurable (at least for now) and throwing an exception if a value cannot be parsed. In this case, a user can be informed about the problem and fix the data source.

mshustov avatar Jul 18 '22 16:07 mshustov

fail-fast invariably will slow down inserts - the client will need to parse the data twice. We could make this a switch to disable?

gingerwizard avatar Jul 18 '22 17:07 gingerwizard

IMO just failing whole batch is ok

ernado avatar Jul 19 '22 14:07 ernado

@ernado @mshustov note i've added issue (2).Shoud failure to append to a batch invalidate the batch?

gingerwizard avatar Jul 29 '22 12:07 gingerwizard

This occurs outside of columnar api appends and standard appends. Currently if we make an invalid append to a batch it means it can't be sent - even if previous rows were successful. We had a bug re this - see https://github.com/ClickHouse/clickhouse-go/issues/798

Noting we considered:

One proposal is that the Append of invalid data should fail but the batch should remain valid i.e. subsequent Send() should succeed. This is trickier as columns are added one by one. If the error occurs due to a column other than the first, we have to undo the previous changes. I suspect this isn't trivial as it will mean reversing the changes to the buffer. We could check to see all values are valid beforehand - this will incur an overhead though.

gingerwizard avatar Nov 23 '22 11:11 gingerwizard

This needs to be resolved and made consistent @mshustov Let discuss

gingerwizard avatar Nov 23 '22 11:11 gingerwizard

Giving my thoughts also on (2) (appending to batch might fail);

My preference would be to just invalidate the entire batch and trying to Send() will return an error. Reason is that, appending data and then having it fail, could basically only happen when types are incorrect, right? Which makes it a 'programmer' error. And if it fails for one record, then very likely for all records. So sending an empty batch is then useless anyhow. As developer I'd much rather see it fail completely and fix it, than some partial behaviour. Especially if adding some records would succeed for whatever reason. Then the result of what is stored in ClickHouse is going to be highly nondeterministic.

mdonkers avatar Nov 23 '22 11:11 mdonkers

Agree with @mshustov:

IMO we should align with the ClickHouse default logic for such cases. ClickHouse implements fail-fast pattern, even though it's configurable. From input_format_allow_errors_num docs:

fail-fast is the right approach here, but I don't want to go way too far and introduce a configurable threshold. Will submit a PR with a test that validates against behavior.

jkaflik avatar Dec 16 '22 21:12 jkaflik