clickhouse-kafka-connect icon indicating copy to clipboard operation
clickhouse-kafka-connect copied to clipboard

fix 'overlapping' logic

Open ne1r0n opened this issue 1 year ago • 9 comments

Summary

  1. Excluded excess data processing when state = `BEFORE_PROCESSING'
  2. Fixed order of chunks in overlapping range cases
  3. Prevented insertion of the 1 chunk in overlapping range case when state = `BEFORE_PROCESSING' and 1 chunk boundaries are not the same as state
  4. Fixed and extended different range overlapping cases taking into account that records are trimmed before processing, for example:
State                 Actual                 Current                 After fix
[2, 10]               [0, 11]                ZERO                    OVER_LAPPING
[2, 10]               [0, 10]                ZERO                    SAME
[2, 10]               [1, 11]                ERROR                   OVER_LAPPING
[2, 10]               [1, 10]                ERROR                   SAME

Checklist

Delete items not relevant to your PR:

  • [x] Unit and integration tests covering the common scenarios were added
  • [x] A human-readable description of the changes was provided to include in CHANGELOG
  • [ ] For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

ne1r0n avatar Dec 28 '23 08:12 ne1r0n

@ne1r0n can you provide a test with content that cover this?

mzitnik avatar Jan 03 '24 18:01 mzitnik

@ne1r0n can you provide a test with content that cover this?

@mzitnik it would require a lot of effort from my side as I'm not a professional java programmer.

All I am trying to fix is the causes of missed messages when using a connector in production-like environment.

For example, in rare cases there could be issues with committing offsets, something like: org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.

After that failure consumer could read more messages from kafka than previously so batch of data would contain part that already was inserted into clickhouse and a new part that wasn't. Currently connector will skip the whole batch. According to the design document, it is either overlapping or the combination of contains and new, where the new should be inserted.

It make sense to implement an integration test that cover above-mentioned this scenario.

Another test could be for splitRecordsByOffset checking the order of parts it returns.

ne1r0n avatar Jan 04 '24 08:01 ne1r0n

@ne1r0n Thanks for your contribution. We will discuss it internally since I want a test that covers it.

mzitnik avatar Jan 04 '24 08:01 mzitnik

I've added 2 tests this PR is trying to fix, please have a look.

ne1r0n avatar Feb 11 '24 18:02 ne1r0n

@mzitnik , @Paultagoras I will be grateful for any updates or comments

ne1r0n avatar Feb 15 '24 16:02 ne1r0n

@mzitnik , @Paultagoras I will be grateful for any updates or comments

Hi @ne1r0n! So I was re-looking at this again and re-reading your early comment:

For example, in rare cases there could be issues with committing offsets, something like: org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.

After that failure consumer could read more messages from kafka than previously so batch of data would contain part that already was inserted into clickhouse and a new part that wasn't. Currently connector will skip the whole batch. According to the design document, it is either overlapping or the combination of contains and new, where the new should be inserted.

So we actually solved a bug with deduplication around this (or a very similar) thing. Imagine the following:

  1. Connector tries to insert batch with minOffset=123 and maxOffset=456
  2. Some issue happens (like offset failed to commit)
  3. Connector tries to insert batch with minOffset=123 but maxOffset=789 (so it overlaps with more data)

The old bug would drop the values after 456 (so 457-789). Was that the sort of issue you were seeing?

Paultagoras avatar Feb 20 '24 16:02 Paultagoras

Hi @Paultagoras, currently the part shown in the picture will be lost image

ne1r0n avatar Feb 20 '24 17:02 ne1r0n

@ne1r0n So I think I saw what you were mentioning (at least partly) - we just merged in a fix to an issue with overlap. Mind seeing if that solves the issue you were seeing?

Paultagoras avatar Mar 04 '24 01:03 Paultagoras

@Paultagoras yep, I saw it, thanks. It solved part of the problem.

Now let's imagine the situation like in ProcessPartialOverlappingBeforeProcessingTest test. We have 2 batches with overlapping ranges and BEFORE_PROCESSING state after 1 insert. 0 to 599 - 1 batch 350 to 999 - 2 batch

Connector will do the following:

  1. insert the 1 batch of 600 records
  2. split the second batch into two parts: 350 to 599 and 600 to 999
  3. insert the 1st part 350 to 599 in BEFORE_PROCESSING case
  4. insert the 2nd part 600 to 999 in BEFORE_PROCESSING case
  5. insert the 2nd part 600 to 999 in AFTER_PROCESSING case

There are 2 issues: Step 3 correspond to contains case from the design document and should be processed accordingly. Step 5 extra insert of the second part, here deduplication_token will fix that, but I think there should be break before case AFTER_PROCESSING:

ne1r0n avatar Mar 04 '24 08:03 ne1r0n

Thanks for submitting this! After re-reviewing it, I think what you said makes sense - especially with the extra insert happening 😅

Paultagoras avatar May 13 '24 15:05 Paultagoras