clickhouse-go
clickhouse-go copied to clipboard
Allow to reuse batch if append failed
Describe the bug
Steps to reproduce
- Prepare valid and bad data in any order
- Prepare batch
- Call AppendStruct in loop
- See logs
Expected behaviour
Currently it's not possible to understand what row is corrupted, even if 1 row of 10000 will have some invalid value, it affects all next data in the current batch.
There are 2 possible ways to give some flexibility:
-
when
AppendStruct
detects any problem with current struct data it returns an error and doesn't append corrupted values to batch, developer must decide what to do with that error on his own, but it should be possible to continue and skip that rows -
when
AppendStruct
detects any problem with current struct data it returns an error, appends row like now, but any next calls ofAppendStruct
with valid data will be succeed
Code example
func AppendStructWithBadData() error {
conn, err := GetNativeConnection(nil, nil, nil)
if err != nil {
return err
}
ctx := context.Background()
defer func() {
conn.Exec(ctx, "DROP TABLE example")
}()
if err := conn.Exec(ctx, `DROP TABLE IF EXISTS example`); err != nil {
return err
}
if err := conn.Exec(ctx, `
CREATE TABLE example (
Col1 String
, Col2 DateTime
) Engine = Memory
`); err != nil {
return err
}
batch, err := conn.PrepareBatch(context.Background(), "INSERT INTO example")
if err != nil {
return err
}
data := []struct {
Col1 string
Col2 time.Time
}{
{
Col1: "valid data", // no error
Col2: time.Now(),
},
{
Col1: "bad data", // error=clickhouse: dateTime overflow. Col2 must be between 1970-01-01 00:00:00 and 2105-12-31 23:59:59
Col2: time.Time{},
},
{
Col1: "valid data", // error=clickhouse: dateTime overflow. Col2 must be between 1970-01-01 00:00:00 and 2105-12-31 23:59:59: clickhouse: batch is invalid. check appended data is correct
Col2: time.Now(),
},
}
for i, r := range data {
err := batch.AppendStruct(&r)
if err != nil {
fmt.Printf("AppendStruct failed: index=%d, error=%+v\n", i, err.Error())
} else {
fmt.Printf("AppendStruct succed: index=%d\n", i)
}
}
fmt.Printf("send batch: rows=%d\n", batch.Rows())
return batch.Send()
}
Error log
AppendStruct succed: index=0
AppendStruct failed: index=1, error=clickhouse: dateTime overflow. Col2 must be between 1970-01-01 00:00:00 and 2105-12-31 23:59:59
AppendStruct failed: index=2, error=clickhouse: dateTime overflow. Col2 must be between 1970-01-01 00:00:00 and 2105-12-31 23:59:59: clickhouse: batch is invalid. check appended data is correct
send batch: rows=2
Error: Received unexpected error:
clickhouse: batch is invalid. check appended data is correct
clickhouse: dateTime overflow. Col2 must be between 1970-01-01 00:00:00 and 2105-12-31 23:59:59
Configuration
Environment
- Client version:
- Language version:
- OS:
- Interface: ClickHouse API / database/sql compatible driver
ClickHouse server
- ClickHouse Server version:
- ClickHouse Server non-default settings, if any:
-
CREATE TABLE
statements for tables involved: - Sample data for all these tables, use clickhouse-obfuscator if necessary
Hi @JILeXanDR
This is the current behavior of batch insertion. On the very first error for data append, the connection is released, and all next append calls will return the previous error.
This error is intentionally wrapped with https://github.com/ClickHouse/clickhouse-go/blob/51cea28b90940b3887266de20b28df6b0e4512ea/clickhouse.go#L46
I agree it might be counter-intuitive, but also, I don't see a good solution here. There might be client-side data validation, where we can recover (like in this case), however we cannot recover for errors sent from ClickHouse.
however we cannot recover for errors sent from ClickHouse.
But any error from Append is local rather than from Clickhouse right?
@yujiarista @JILeXanDR at the end of day, I agree this should not break batch. This requires enhancement.
@yujiarista @JILeXanDR I had a look on this today and found a discussion (I forgot about it 🤦 ) we already had in the past on this: https://github.com/ClickHouse/clickhouse-go/issues/655
tl;dr given columnar append, it's not trivial to guarantee reusable batch without data corruption. I still agree it needs to be enhanced, but not sooner than in v3.