pg_net icon indicating copy to clipboard operation
pg_net copied to clipboard

Add retries

Open steve-chavez opened this issue 1 year ago • 6 comments

Hey there! I am wanting a new feature that doesn't seem to be present in supabase regarding webhooks. Sometimes my API can get backed up or is down for a few seconds, and webhooks may fail. Some functions rely on webhooks to work and provide accurate data to customers, so like Stripe does, I would like an automated retry system.

If a webhook fails, it will automatically try in a time like 30 minutes. If that fails, it retries in an hour, maybe to 5 retries total. That way, no webhook is missed.

Originally posted by @Wamy-Dev in https://github.com/supabase/supabase/discussions/17664

steve-chavez avatar Sep 25 '23 17:09 steve-chavez

This is already possible with some work as demonstrated in https://github.com/supabase/pg_net#retrying-failed-requests but it would be better to have retries built-in

Curl does retries https://everything.curl.dev/usingcurl/downloads/retry but libcurl doesn't have this functionality, we need to implement our own.

steve-chavez avatar Sep 25 '23 18:09 steve-chavez

In Issue 62, you mentioned the following:

An alternative [to error recovery] could be logging the failed http requests. Then the user would have to search the pg logs.

Originally, the plan was to provide users with the option to log error messages to a designated logger for handling:

... We [can] optionally log something based on the status code when the response is handled?

We should have a config to turn this on/off, maybe named like pg_net.log_http_errors(off by default). -Issue 49

However, this feature has not been implemented. The config variable for enabling logging is absent in the worker.c codebase. Nonetheless, there may be potential benefits in expanding upon this concept.

Rather than enabling the recording of request failure messages in a log file, the config variable could instruct the extension to automatically reintegrate all dropped requests back into the "net.http_request_queue" table. This approach, unlike the existing solution, would demand the addition of just a single variable and the replacement of one of the error logs with the C equivalent of an INSERT statement.

It is important to emphasize that this suggestion would exclusively monitor requests that failed due to an extension-related error. Requests resulting in 400 or 500 HTTP status errors are already logged in net._http_response and should be managed by users. Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts. For users wishing to oversee these specific failed requests, the existing README suggestion provides a safer mechanism, albeit with greater overhead.

TheOtherBrian1 avatar Sep 25 '23 23:09 TheOtherBrian1

@TheOtherBrian1 Great points.

Requests resulting in 400 or 500 HTTP status errors are already logged in net._http_response and should be managed by users. Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts

True. I think the only safe way to auto-retry would be by following the Retry-After header, however I don't think all services implement it.

We should have a config to turn this on/off, maybe named like pg_net.log_http_errors(off by default). -https://github.com/supabase/pg_net/issues/49#issuecomment-948844080 However, this feature has not been implemented.

Agree, that would be higher priority than this idea.

Rather than enabling the recording of request failure messages in a log file, the config variable could instruct the extension to automatically reintegrate all dropped requests back into the "net.http_request_queue" table It is important to emphasize that this suggestion would exclusively monitor requests that failed due to an extension-related error.

Hm, do you have examples of extension-related errors? I don't recall one. IIUC it's an internal error rather than a 400 bad request for example.

steve-chavez avatar Sep 27 '23 19:09 steve-chavez

Hm, do you have examples of extension-related errors?

I haven't experienced any yet, but I've thought about the potential for them to happen.

TheOtherBrian1 avatar Sep 28 '23 06:09 TheOtherBrian1

Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts.

we run the retries on an exponential scale that tries with increasing periods, that ensure resilience to up to 24 hours, in case of connectivity downtime due to updates, issues etc.

^ From https://github.com/orgs/supabase/discussions/17664#discussioncomment-7208243

We could cap the retrying as suggested above, that way we don't do endless loops.

steve-chavez avatar Oct 06 '23 15:10 steve-chavez

Currently using Supabase to fire off webhooks to vercel endpoints.

Vercel has been running into issues with Undici when there's CPU contention, so making retries w/ exponential backoff almost mandatory to be used for event-driven applications.

Wasn't quite sure how to implement the official retry solution: https://github.com/supabase/pg_net#retrying-failed-requests

Instead used QStash. so far working as a charm

lookevink avatar Aug 10 '24 18:08 lookevink

@lookevink is QStash still the way to go for an event driven system or have you learnt anything new so far?

I need to get db changes on a background worker on trigger.dev. I was hoping to send straight, or at least through my app on vercel and then to trigger.dev.

Going to QStash for durability seems like adding more complexity.

Do you mind explaining the vercel undici CPU issue? I thought each request was its own serverless call?

shawnmclean avatar Oct 06 '24 20:10 shawnmclean

I'm sticking to QStash for now because I'm hosting on vercel, which has been running into cpu contention on some serverless function

Trigger.dev seems like a great way to handle this, but we have bigger fish to fry since the current setup works

On Sun, Oct 6, 2024 at 4:53 PM Shawn Mclean @.***> wrote:

@lookevink https://github.com/lookevink is QStash still the way to go for an event driven system or have you learnt anything new so far?

I need to get db changes on a background worker on trigger.dev. I was hoping to send straight, or at least through my app on vercel and then to trigger.dev.

Going to QStash for durability seems like adding more complexity.

— Reply to this email directly, view it on GitHub https://github.com/supabase/pg_net/issues/110#issuecomment-2395581692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3D6C4GJIQFRDUX2MSS5RDZ2GPODAVCNFSM6AAAAABMKA7IUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJVGU4DCNRZGI . You are receiving this because you were mentioned.Message ID: @.***>

lookevink avatar Oct 06 '24 22:10 lookevink

@shawnmclean, I've been thinking about rewriting the pg_net retry section. I have an early draft. I'm fairly confident in stages 1-6 and 8. It can be used as a basis for any retry system.

I'm personally not a huge fan of using PL/PgSQL for retries, even though that's what I'm currently working on. It defeats the purpose of using unlogged tables for performance gains - a foundational advantage of pg_net. However, this approach does make the solution configurable for Postgres developers and gives them the most control

TheOtherBrian1 avatar Oct 08 '24 01:10 TheOtherBrian1