quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Deleting and recreating a topic for an index result in retries on a non-recoverable scenario

Open kstaken opened this issue 2 years ago • 7 comments

Describe the bug If you delete and recreate the topic backing a Kafka source after reading data through it the server goes into a non-recoverable retry loop due to missing offsets.

2022-06-02T21:21:54.244Z ERROR {actor=quickwit_indexing::actors::indexing_service::IndexingService}:{msg_id=2}::{msg_id=148}: quickwit_indexing::actors::indexing_pipeline: Error while spawning indexing pipeline, retrying after some time. error=Failed to create source `quickwit-kafka-test` of type `kafka`. Cause: Last checkpointed offset `99999` is greater or equal to high watermark `0`.

Caused by:
    Last checkpointed offset `99999` is greater or equal to high watermark `0`. retry_count=5 retry_delay=64s

If you then use this topic and push more data through it I would expect this might start working again after getting to 100,000 records but it will result in the loss of the first 100,000 records since offsets will restart from 0.

Steps to reproduce (if applicable) Steps to reproduce the behavior:

  1. setup a kafka source
  2. push some data through it so you have committed offsets
  3. delete the topic
  4. recreate the topic
  5. restart the server and you get a continually retrying error

Expected behavior This isn't really a recoverable scenario without using offset reset logic like the Kafka high level consumer uses. In this case the correct scenario is likely to detect the missing offset and reset to the low watermark for the topic.

You're also going to have a similar problem if a topic has retention configured and the data expires before Quickwit reads it. The low watermark on the topic in that case will be higher than the committed offset and the offset will need to be moved forward.

kstaken avatar Jun 02 '22 21:06 kstaken

You're also going to have a similar problem if a topic has retention configured and the data expires before Quickwit reads it. The low watermark on the topic in that case will be higher than the committed offset and the offset will need to be moved forward.

I think the current spec takes this in account. In this situtation we accept the checkpoint delta and log a warning.

In other words I think the current spec is

pos=4 | delta=4.,23 -> pos =23 pos=4 | delta=10.,23 -> logs a warning + pos =23 pos=4 | delta=2..,23 -> logs an error + refuses the delta.

We do not have a special case on this error, so the indexing pipeline will restart and retry for ever with exponential backoff (capped at 10mn), until a human fixes the situation.

fulmicoton avatar Jun 03 '22 00:06 fulmicoton

One interesting thing to note is that the error is coming from the kafka client. This is an extra clue that what happens is really that while the targetted topic has the same name, it is probably not the same topic (delete/recreate OR some weird "I imported my config from another cluster without caring about checkpoints").

People can workaround by removing / readding the source.

We should also add a reset checkpoint operation, a little bit later, and if possible give pointers in the error message maybe. (maybe add a url to the doc?)

fulmicoton avatar Jun 03 '22 01:06 fulmicoton

What do you think @kstaken ?

fulmicoton avatar Jun 03 '22 01:06 fulmicoton

In this case the correct scenario is likely to detect the missing offset and reset to the low watermark for the topic.

I'm wary of implicitly resuming from the low watermark. IMO, this behavior should be at least opt-in via a configuration parameter for the source. In production, as an operator, you probably want to take the time to understand why the committed offsets are no longer within the bounds of the topic and only then take the appropriate actions.

Did you recreate the topic to "start fresh" as you were testing the Kafka source?

You're also going to have a similar problem if a topic has retention configured and the data expires before Quickwit reads it. The low watermark on the topic in that case will be higher than the committed offset and the offset will need to be moved forward.

We have taken this corner case into account. The checkpoint API allows "ingestion gaps", and the Kafka source starts ingesting from the low watermark when that happens (source code).


Finally, regarding checkpoint management, I realize that, as of today, it is particularly painful to update a checkpoint manually, we should at least improve that.

guilload avatar Jun 03 '22 01:06 guilload

You beat me to it @fulmicoton :)

guilload avatar Jun 03 '22 01:06 guilload

In this instance I originally created the topic with a single partition and then recreated it with 10 partitions just to test behavior.

Clearly this wouldn't be a normal thing to have happen although we certainly have done it. The main scenario where that would happen is when the data in the topic is junk for some reason and we want to quickly clear the entire topic. Along with that though I would expect we would also delete any indexed data as well so it should inherently clean it up.

It was mostly the retries that was bothering me along with the fact that data sent to the topic would be silently lost if you used it before seeing the error but thinking about it more, the answer here is probably just don't do this without also deleting the index.

I think this can probably just be closed in lieu of a future feature to manually reset offsets.

kstaken avatar Jun 03 '22 08:06 kstaken

Created #1631 , closing this issue.

fulmicoton avatar Jun 10 '22 01:06 fulmicoton