lookout Retry for poster

Most probably should be umbrella.

We need to decide if we want some retry specific for github or it should be common retry in server (though for example http errors don't make much sense for json poster)

We need to cover different errors differently:

Temporary error from remote server (for example 500) - should retry N (or forever?) times with an increasing delay
Network errors - similar to point 1
Permanent errors from remote server (for example 400) - shouldn't retry but mark them somehow (for example put in giveup queue), so we can retry them manually if error was caused by our code
Internal errors - not sure what to do here. Most probably similar to point 3.

Aug 17 '18 15:08 smacker

3 solutions discussed for storing re-try state:

in memory, global Map of URLs
RDBMS, add re-try count
external system: Queue or Redis

first 2 solution would be different from the 3rd, preferred one. So let's keep it open for now until we have a queue.

Aug 21 '18 15:08 bzz

We already have a queue (RabbitMQ). Considering the different errors described by @smacker, what would be the retry policy?

Jan 11 '19 16:01 dpordomingo

Just for curiosity, is there some link to issue/PR/minutes/design doc containing the rationale behind the choice of RabbitMQ as queue instead for example of Kafka or others?

Apr 11 '19 18:04 se7entyse7en

BTW the easiest thing would be to just re-enqueue failed jobs indefinitely or add a retried field. Then we can improve this solution by having another queue for "dead" jobs (as @smacker said a 400 could be interpreted as a permanent error). If we're going to do manual operations on this dead jobs we could also consider using a DB as an alternative instead of using another queue. I'd use a queue if we plan to have another consumer that consumes from this queue and performs something.

I still have to think about how we could handle the increasing delay. 🤔

Apr 11 '19 19:04 se7entyse7en

One thing that we could do for jobs that failed with a recoverable error, is to push them into a "recoverable" queue with a scheduled_for key, and have consumers reading from it. If the read job has a scheduled_for datetime that is elapsed, then it is retried otherwise it's re-enqueued or the consumer waits. In the case of republishing in the queue, the delay won't be exact, but I don't think this is important.

Apr 11 '19 20:04 se7entyse7en

Kafka isn't a queue.

For other alternatives: we don't have any specific requirements for a queue. More or less any other queue would work for us. But RabbitMQ is most probably the most popular and flexible solution plus it is supported by src-d/go-queue.

Apr 11 '19 20:04 smacker

It's more or less clear how to deal with errors mentioned in the description. A more interesting question is how far do we want to go on a range of simplicity-robustness.

Apr 11 '19 20:04 smacker

Kafka isn't a queue.

You're right in the sense that it's not strictly FIFO, I don't know whether you mean something else, on the other hand, I would define it a multi-partitioned queue.

plus it is supported by src-d/go-queue

So it was already supported by src-d/go-queue when have been chosen to use RabbitMQ for lookout and not the other way round?

A more interesting question is how far do we want to go on a range of simplicity-robustness.

Agree 👍

Apr 12 '19 09:04 se7entyse7en

You're right in the sense that it's not strictly FIFO, I don't know whether you mean something else, on the other hand, I would define it a multi-partitioned queue.

Kafka is a distributed stream which can be re-played. That sometimes overlaps with what people expect from a queue. But it doesn't have acknowledging mechanism.

So it was already supported by src-d/go-queue when have been chosen to use RabbitMQ for lookout and not the other way round?

yes. Source{d} uses rabbitmq for other projects too (at least borges) which were created long before lookout.

Agree

Then the current plan would be:

use queue for posting
implement simple retry for Temporary/Network errors using that queue
revisit this issue again

Apr 12 '19 11:04 smacker

But it doesn't have acknowledging mechanism.

Do you mean on producer or consumer side? On producer side there are actually 2 levels of ack, while on the consumer side you can commit the offset. (BTW I never used RabbitMQ in depth)

Then the current plan would be:

use queue for posting

implement simple retry for Temporary/Network errors using that queue

revisit this issue again

👍

Apr 12 '19 11:04 se7entyse7en

On producer side there are actually 2 levels of ack, while on the consumer side you can commit the offset.

on consumer. The offset isn't an ack in the meaning of classic queue. The offset just confirms that messages up to this point were read by a consumer. But doesn't confirm messages were proccessed.

In the classic queue you have ack per message and reading a message isn't equal to acknowledging it. Simple example:

The consumer reads 5 messages
At this point, other consumers can't read them anymore but the queue still keeps track of them
The consumer starts working on each of them in parallel
Messages 1, 3, 5 are processed correctly and acked (they got removed from the queue)
During processing of the 2nd message, error happened and crashed consumer.
The queue doesn't receive ack for 2, 4 during a configured time and makes them available to the other consumers again without any extra logic on consumer side.

Kafka doesn't provide such a feature out of the box afaik.

Though this conversation is complete offtopic :)

Apr 12 '19 11:04 smacker

Thanks for the explanation!

Yup, Kafka is a bit different, but a partition can be consumed by at-most one consumer (of the same group), so usually to be sure that a message has been processed the offset is committed after the processing and not after the read. But it doesn't provide an ack per-single-message as explained by you, so the consumer would have to keep the state of the consumed, but already processed, in order to skip them.

Gonna read more about RabbitMQ 👍.

Though this conversation is complete offtopic :)

Yup, sorry 😅.

Apr 12 '19 11:04 se7entyse7en