lookout icon indicating copy to clipboard operation
lookout copied to clipboard

Retry for poster

Open smacker opened this issue 7 years ago • 12 comments

Most probably should be umbrella.

We need to decide if we want some retry specific for github or it should be common retry in server (though for example http errors don't make much sense for json poster)

We need to cover different errors differently:

  1. Temporary error from remote server (for example 500) - should retry N (or forever?) times with an increasing delay
  2. Network errors - similar to point 1
  3. Permanent errors from remote server (for example 400) - shouldn't retry but mark them somehow (for example put in giveup queue), so we can retry them manually if error was caused by our code
  4. Internal errors - not sure what to do here. Most probably similar to point 3.

smacker avatar Aug 17 '18 15:08 smacker

3 solutions discussed for storing re-try state:

  • in memory, global Map of URLs
  • RDBMS, add re-try count
  • external system: Queue or Redis

first 2 solution would be different from the 3rd, preferred one. So let's keep it open for now until we have a queue.

bzz avatar Aug 21 '18 15:08 bzz

We already have a queue (RabbitMQ). Considering the different errors described by @smacker, what would be the retry policy?

dpordomingo avatar Jan 11 '19 16:01 dpordomingo

Just for curiosity, is there some link to issue/PR/minutes/design doc containing the rationale behind the choice of RabbitMQ as queue instead for example of Kafka or others?

se7entyse7en avatar Apr 11 '19 18:04 se7entyse7en

BTW the easiest thing would be to just re-enqueue failed jobs indefinitely or add a retried field. Then we can improve this solution by having another queue for "dead" jobs (as @smacker said a 400 could be interpreted as a permanent error). If we're going to do manual operations on this dead jobs we could also consider using a DB as an alternative instead of using another queue. I'd use a queue if we plan to have another consumer that consumes from this queue and performs something.

I still have to think about how we could handle the increasing delay. 🤔

se7entyse7en avatar Apr 11 '19 19:04 se7entyse7en

One thing that we could do for jobs that failed with a recoverable error, is to push them into a "recoverable" queue with a scheduled_for key, and have consumers reading from it. If the read job has a scheduled_for datetime that is elapsed, then it is retried otherwise it's re-enqueued or the consumer waits. In the case of republishing in the queue, the delay won't be exact, but I don't think this is important.

se7entyse7en avatar Apr 11 '19 20:04 se7entyse7en

Kafka isn't a queue.

For other alternatives: we don't have any specific requirements for a queue. More or less any other queue would work for us. But RabbitMQ is most probably the most popular and flexible solution plus it is supported by src-d/go-queue.

smacker avatar Apr 11 '19 20:04 smacker

It's more or less clear how to deal with errors mentioned in the description. A more interesting question is how far do we want to go on a range of simplicity-robustness.

smacker avatar Apr 11 '19 20:04 smacker

Kafka isn't a queue.

You're right in the sense that it's not strictly FIFO, I don't know whether you mean something else, on the other hand, I would define it a multi-partitioned queue.

plus it is supported by src-d/go-queue

So it was already supported by src-d/go-queue when have been chosen to use RabbitMQ for lookout and not the other way round?

A more interesting question is how far do we want to go on a range of simplicity-robustness.

Agree 👍

se7entyse7en avatar Apr 12 '19 09:04 se7entyse7en

You're right in the sense that it's not strictly FIFO, I don't know whether you mean something else, on the other hand, I would define it a multi-partitioned queue.

Kafka is a distributed stream which can be re-played. That sometimes overlaps with what people expect from a queue. But it doesn't have acknowledging mechanism.

So it was already supported by src-d/go-queue when have been chosen to use RabbitMQ for lookout and not the other way round?

yes. Source{d} uses rabbitmq for other projects too (at least borges) which were created long before lookout.

Agree

Then the current plan would be:

  1. use queue for posting
  2. implement simple retry for Temporary/Network errors using that queue
  3. revisit this issue again

smacker avatar Apr 12 '19 11:04 smacker

But it doesn't have acknowledging mechanism.

Do you mean on producer or consumer side? On producer side there are actually 2 levels of ack, while on the consumer side you can commit the offset. (BTW I never used RabbitMQ in depth)

Then the current plan would be:

  1. use queue for posting
  2. implement simple retry for Temporary/Network errors using that queue
  3. revisit this issue again

👍

se7entyse7en avatar Apr 12 '19 11:04 se7entyse7en

On producer side there are actually 2 levels of ack, while on the consumer side you can commit the offset.

on consumer. The offset isn't an ack in the meaning of classic queue. The offset just confirms that messages up to this point were read by a consumer. But doesn't confirm messages were proccessed.

In the classic queue you have ack per message and reading a message isn't equal to acknowledging it. Simple example:

  1. The consumer reads 5 messages
  2. At this point, other consumers can't read them anymore but the queue still keeps track of them
  3. The consumer starts working on each of them in parallel
  4. Messages 1, 3, 5 are processed correctly and acked (they got removed from the queue)
  5. During processing of the 2nd message, error happened and crashed consumer.
  6. The queue doesn't receive ack for 2, 4 during a configured time and makes them available to the other consumers again without any extra logic on consumer side.

Kafka doesn't provide such a feature out of the box afaik.

Though this conversation is complete offtopic :)

smacker avatar Apr 12 '19 11:04 smacker

Thanks for the explanation!

Yup, Kafka is a bit different, but a partition can be consumed by at-most one consumer (of the same group), so usually to be sure that a message has been processed the offset is committed after the processing and not after the read. But it doesn't provide an ack per-single-message as explained by you, so the consumer would have to keep the state of the consumed, but already processed, in order to skip them.

Gonna read more about RabbitMQ 👍.

Though this conversation is complete offtopic :)

Yup, sorry 😅.

se7entyse7en avatar Apr 12 '19 11:04 se7entyse7en