pg_net icon indicating copy to clipboard operation
pg_net copied to clipboard

Webhooks fail when multiple inserts are done

Open devdras opened this issue 2 years ago • 46 comments

Bug report

Describe the bug

I have a webhook configured to watch a table for inserts. On an insert a POST request should be fired to an API hosted on Vercel. This works when one insert operation is carried out. If multiple inserts are carried out: .insert([{message: 1}, {message: 2}, {message: 3}]) then the webhooks aren't fired at all. If instead we carry out multiple inserts by calling the Supabase API multiple times then the webhook is fired for only some of the inserts.

To Reproduce

Steps to reproduce the behavior, please provide code snippets or a repository:

  1. Configure a webhook on a table to watch for insert operations and call an external API

  2. Use the Supabase JS library to perform multiple inserts like: let { data, error } = await supabase .from("example_table") .insert([{message: 1}, {message: 2}, {message: 3}, {message: 4}, {message: 5}, {message: 6}, etc ]) .select("message");

  3. Check logs for external API to confirm it didn't receive any requests

  4. See error

Something to note is that I was inserting 32 things at once.

Expected behavior

For each insert, I would expect the external API to be called. It would also help if there was a log in Supabase that showed what happens when the webhook is run.

Screenshots

If applicable, add screenshots to help explain your problem.

System information

  • OS: N/A
  • Browser (if applies): N/A
  • Version of supabase-js: 2.0.0-rc.8
  • Version of Node.js: v16.17.0 on desktop, 18.x on Vercel (per their docs)

Additional context

Add any other context about the problem here.

devdras avatar Jan 10 '23 13:01 devdras

Thank you for posting this. I'm receiving a similar error.

Surprisingly when I create 10 records at the same time it fires the function for each 10 records...

INSERT INTO public.profiles (id, first_name, last_name, stripe_customer_id, email)
SELECT u.id, NULL, NULL, NULL, u.email
FROM auth.users u
LEFT JOIN public.profiles p ON u.id = p.id
WHERE p.id IS NULL
LIMIT 10;

However, if I fire the function for 100 records...

INSERT INTO public.profiles (id, first_name, last_name, stripe_customer_id, email)
SELECT u.id, NULL, NULL, NULL, u.email
FROM auth.users u
LEFT JOIN public.profiles p ON u.id = p.id
WHERE p.id IS NULL
LIMIT 100;

It fails

danrasmuson avatar Feb 09 '23 22:02 danrasmuson

Yup, basically the webhook behaviour is flakey. I haven't tried it recently. I wanted to use this to build an event driven architecture but can't without it being reliable. Ended up using QStash as a replacement for the webhook. Would love Supabase to have some kind of full featured event queue like SQS/Eventbridge/Service Bus.

devdras avatar Feb 09 '23 22:02 devdras

We've tested pg_net sending 1K inserts per second, this to a local nginx server though.

@devdras @danrasmuson Which webhook service are you using? Likely it's rate limiting the requests.

steve-chavez avatar Feb 10 '23 23:02 steve-chavez

We've tested pg_net sending 1K inserts per second, this to a local nginx server though.

@devdras @danrasmuson Which webhook service are you using? Likely it's rate limiting the requests.

Im using vercel with serverless functions

danrasmuson avatar Feb 11 '23 01:02 danrasmuson

We're experiencing the same problem. When inserting records in sequence all the webhooks are received. But if we insert 30 rows in parallel none of the webhooks send and the net/_http_response table is full of Timeout was reached errors.

matt-aitken avatar Oct 09 '23 14:10 matt-aitken

@matt-aitken Are you using vercel serverless functions too?

steve-chavez avatar Oct 09 '23 16:10 steve-chavez

@steve-chavez We're able to reliably reproduce it based on the number of new entries that get bulk added to the table—have been in touch with your support team about this as well, but unable to resolve it. 

  • The webhook trigger is calling an AWS Lambda function
  • The lambda function does not record any invocations if the number of records inserted is larger than >150
    • This is an all or nothing behaviour. If we insert fewer records, all invocations happen, if we insert more records none of the invocations happen
    • It's not the case that only some of the invocations don't come through
    • It's not the case that the invocation reaches the Lambda function and the webhook times out before the function responds
  • In supabase, the _http_response table only logs "Timeout was reached", nothing else
  • We have tested this with different webhooks — including webhook.site. No invocations of the webhook are received

Happy to prepare a screen-recording or do a walk through if helpful.

lohrmann avatar Oct 13 '23 00:10 lohrmann

The lambda function does not record any invocations if the number of records inserted is larger than >150

@lohrmann That might be related to lambdas rate limiting https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#api-requests

We're able to reliably reproduce it based on the number of new entries that get bulk added to the table

By default there are 200 requests per iteration https://github.com/supabase/pg_net/blob/master/src/worker.c#L479-L483

I'll make this parameter user configurable.

steve-chavez avatar Oct 18 '23 16:10 steve-chavez

Curl has:

curl --rate 2/s -O https://example.com/[1-100].jpg

https://everything.curl.dev/usingcurl/transfers/request-rate#examples

I think that should be the final solution here.


Note: the above is not available on libcurl.

steve-chavez avatar Oct 19 '23 02:10 steve-chavez

Sorry for the slow response on this. No, we're receiving webhooks to AWS EC2 (a long-running server).

There's nothing in our server logs where we receive requests from Supabase (when inserting many rows) and the net/_http_response table is full of Timeout was reached errors.

We (Trigger.dev) make it easy for our customers to subscribe to these Supabase database webhooks and quite a few people are hitting this problem now.

matt-aitken avatar Nov 03 '23 11:11 matt-aitken

We're experiencing the same problem. When inserting records in sequence all the webhooks are received. But if we insert 30 rows in parallel none of the webhooks send and the net/_http_response table is full of Timeout was reached errors.

I have the same issue. I'm using Google Cloud Functions v2 (Cloud Run). Table is full of Timeout was reached, but in Google Cloud Monitoring don't have any logs.

ahanusek avatar Dec 31 '23 09:12 ahanusek

I'm running into this issue as well. I spun up a clean environment to reproduce, and did some testing. Supabase project qutskxfebtlgwyzsonvd. The setup is:

  • A users table
  • A queue table
  • A cron job that inserts one item into the queue for each user
  • A database webhook that sends the queue insert to an HTTPS endpoint, an AWS API Gateway
  • AWS API Gateway configured for asynchronous invocation, which always returns a 200 response (typically in <80ms)
  • AWS Lambda function that is just sleep(5)

I ran a handful of tests (spreadsheet here, screenshot below). It looks like a pretty linear relationship between the number of requests (in my case, the number of users for which the cron job inserts items into the queue) and the webhook timeout configuration. By my math, the webhook needs about 33.33ms per item to process. If only 20ms per item is available within the webhook timeout configuration, most or all of the events fail to be sent to the API Gateway.

Unlike the note from @lohrmann above, I did see some partial successes.

Screenshot 2024-01-06 at 6 22 11 PM

reedshea avatar Jan 06 '24 23:01 reedshea

@reedshea Thanks for the detailed information.

@TheOtherBrian1 Shared a reproducible example: https://gist.github.com/TheOtherBrian1/e479426f0070d5f0879483cf59f85dfb

The setup is basically:

CREATE TABLE webhook_test (
    id SERIAL,
    sample text
);

CREATE OR REPLACE FUNCTION debugging_example()
RETURNS TRIGGER
AS $$
DECLARE
    -- Store payload data
    row_data_var JSONB;
    row_id INT;
BEGIN
    -- Retrieve row data and convert to JSON
    SELECT TO_JSONB(NEW.*) INTO row_data_var;
    -- Initiate HTTP POST request to URL
    SELECT
        net.http_post(
            url := 'https://postman-echo.com/post',
            -- Use row data as payload
            body := row_data_var
        ) INTO row_id;

    RAISE LOG 
        'LOGGING WEBHOOK |
        The Webhook was called after %s |
        ID = %s |
        Logging an entire row as JSON (%)', (SELECT NOW()), row_id, row_data_var;
    -- Return the row data to inspect
    RETURN NEW;
-- Handle exceptions here if needed
EXCEPTION
    WHEN others THEN
        RAISE EXCEPTION 'LOGGING WEBHOOK FAILED: An error occurred: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;


CREATE TRIGGER webhook
    AFTER INSERT ON webhook_test
    FOR EACH ROW
    EXECUTE FUNCTION debugging_example();

CREATE OR REPLACE FUNCTION insert_test_data(num_rows INT)
RETURNS VOID
AS $$
DECLARE
    i INT;
BEGIN
    FOR i IN 1..num_rows LOOP
        INSERT INTO webhook_test(sample) VALUES (i || 'Call');
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Then execute a request:

INSERT INTO webhook_test (sample)
VALUES ('test value');

Then a batch of requests:

SELECT insert_test_data(100);

Locally, all requests succeed fast:

select count(*) from net._http_response where error_msg is null;
-[ RECORD 1 ]
count | 101

select count(*) from net._http_response where error_msg is not null;
-[ RECORD 1 ]
count | 0

But, for some reason, on Supabase cloud only the single request succeeds and the batched ones fail:

select count(*) from net._http_response where error_msg is null;
count
1
select count(*) from net._http_response where error_msg is not null;
count
100

Will debug this further.


I should add that Supabase is on 0.7.1, the latest version is 0.7.3. I've tried 0.7.1 locally and the result is the same.

steve-chavez avatar Jan 11 '24 01:01 steve-chavez

By increasing the default timeout (2 seconds) to 4 seconds, all the above requests succeed on Supabase cloud:

        net.http_post(
            url := 'https://postman-echo.com/post',
            -- Use row data as payload
            body := row_data_var,
            timeout_milliseconds := 4000
        ) INTO row_id;
SELECT insert_test_data(100);

select count(*) from net._http_response where error_msg is null;
count
100

Will increase the default to 5 seconds.

steve-chavez avatar Jan 18 '24 12:01 steve-chavez

Will increase the default to 5 seconds.

Is that setting the timeout for the whole batch? It seems based on @reedshea's graph, the whole batch needs enough time per call. Or is he setting some other timeout?

vickkhera avatar Jan 18 '24 15:01 vickkhera

I just ran into this with my staging environment. I updated a table which caused 121 events to need to be published, all in one transaction (single UPDATE statement). All 121 http_post requests fail with timeout error after ~2 seconds. I bumped the timeout_milliseconds to 4000 but that wasn't sufficient. I had to bump it to 6000 to allow all of them to succeed. I'm submitting task requests to Inngest, which responds really fast.

I think the right thing is to multiply the timeout by the number of requests per batch if it is not possible to make it actually be a per-request timeout. Passing the timeout on a per-request basis gives the developer the expectation that it applies to that one request, not the entire batch in which that request happens to be included by the net extension batching algorithm.

What happens if different requests get passed different timeouts? Which one is used for that batch?

vickkhera avatar Feb 05 '24 19:02 vickkhera

@vickkhera I wasn't actually aware of the cause of the bug - I was only aware of how to reproduce it - but you seem to have noticed that the timeout is caused when a batch (default 500) does not complete in the set timeout period. A later batch would actually succeed if its queue were smaller or the timeout were higher.

I assumed that the timeout setting was for an individual cURL requests and not the entire batch. @steve-chavez, is this the intended behavior?

TheOtherBrian1 avatar Feb 06 '24 05:02 TheOtherBrian1

The timeouts are per-request. Will add a test to confirm soon.

steve-chavez avatar Feb 09 '24 16:02 steve-chavez

The timeouts are per-request. Will add a test to confirm soon.

There's no way my requests were taking 2s each. The whole batch timed out in ~2-3 seconds.

vickkhera avatar Feb 09 '24 18:02 vickkhera

We're running into this same issue. "net._http_response" table shows the "Timeout was reached" error yet the "timed_out" column remains null. This is happening with batches of 8 or less, and it seemed like after the initial failure subsequent inserts into the table with the trigger failed even with a batch of 1.

andrueandersoncs avatar Feb 16 '24 20:02 andrueandersoncs

I'm having this problem as well. Can reproduce it based on the number of new entries that get bulk added to the table. Around 10 work, more than that get a significant % of errors

  • The webhook trigger is calling an AWS Lambda function (via Gateway).
  • The lambda function does not record any invocations if the number of records inserted
  • It's not the case that only some of the invocations don't come through
  • It's not the case that the invocation reaches the Lambda function and the webhook times out before the function responds
  • Other times it doesn't time out and just returns a 500 with "Internal Server Error"
  • In supabase, the _http_response table logs everything

All using pgnet and supabase webhooks. Would love if theres some solution? Using the lambda to then load SQS. Thinking of skipping this step now...

ham-evans avatar Mar 21 '24 23:03 ham-evans

I'm having this problem as well. Can reproduce it based on the number of new entries that get bulk added to the table. Around 10 work, more than that get a significant % of errors

  • The webhook trigger is calling an AWS Lambda function (via Gateway).
  • The lambda function does not record any invocations if the number of records inserted
  • It's not the case that only some of the invocations don't come through
  • It's not the case that the invocation reaches the Lambda function and the webhook times out before the function responds
  • Other times it doesn't time out and just returns a 500 with "Internal Server Error"
  • In supabase, the _http_response table logs everything

All using pgnet and supabase webhooks. Would love if theres some solution? Using the lambda to then load SQS. Thinking of skipping this step now...

I am doing something similar, except API Gateway populates SQS which then calls a lambda function. Worked fine on small data sets but when I had to make over 100 calls I saw similar behavior as to what you (and others) describe. I set the timeout to 5 seconds and it is working now for me, although that feels very fragile. Who knows if it will work if I have a big burst of calls I need to make. The UI says 5000 ms is the max timeout, but I believe you can set it higher than that if you just create the function via SQL editor, though I'm not sure if that has any side effects.

waylon999 avatar Mar 21 '24 23:03 waylon999

I'm having this problem as well. Can reproduce it based on the number of new entries that get bulk added to the table. Around 10 work, more than that get a significant % of errors

  • The webhook trigger is calling an AWS Lambda function (via Gateway).
  • The lambda function does not record any invocations if the number of records inserted
  • It's not the case that only some of the invocations don't come through
  • It's not the case that the invocation reaches the Lambda function and the webhook times out before the function responds
  • Other times it doesn't time out and just returns a 500 with "Internal Server Error"
  • In supabase, the _http_response table logs everything

All using pgnet and supabase webhooks. Would love if theres some solution? Using the lambda to then load SQS. Thinking of skipping this step now...

I am doing something similar, except API Gateway populates SQS which then calls a lambda function. Worked fine on small data sets but when I had to make over 100 calls I saw similar behavior as to what you (and others) describe. I set the timeout to 5 seconds and it is working now for me, although that feels very fragile. Who knows if it will work if I have a big burst of calls I need to make. The UI says 5000 ms is the max timeout, but I believe you can set it higher than that if you just create the function via SQL editor, though I'm not sure if that has any side effects.

yea i have it set to 5000ms and still having the problem. Agree it seemed fragile regardless. would love to know if doing via SQL Editor solves?

Otherwise gonna basically skip the webhook altogether as can't risk it in production

ham-evans avatar Mar 21 '24 23:03 ham-evans

@waylon999 You can set the timeout to greater than 5000ms. That's a limitation imposed by the UI. You can write the webhooks directly in pg_net or use the supabase_functions.http_request helper function.

TheOtherBrian1 avatar Mar 22 '24 03:03 TheOtherBrian1

This bug has been open for a year, and it's really frustrating that issues like this are not properly addressed by the Supabase team. They are adding other advanced features like branching, but the core remains unstable, and I don't see a significant effort to fix that. I migrated a relatively small project from Firestore/Firebase to Supabase because I believed in their marketing, and now I really regret this decision. Currently, I'm paying more for significantly worse quality:

  • Webhooks are unstable. In my previous architecture, I heavily relied on Firestore triggers to perform some operations in the background. Now, they can randomly fail without any option to retry, which is a HUGE problem when something needs to happen in the background of our app.

  • Realtime functionality is useless; it only works fine when you have an active tab in the browser. Then, you need to figure out the whole mechanism for refetching data and reconnecting after you go from inactive. I don't even want to mention how painful this is on mobile (React Native). With Firebase, I didn't even have to think about it; it just worked out of the box.

  • Supervisor / pg_bouncer - Random failures when connecting to the DB from cloud functions / AWS Lambda (Connection terminated unexpectedly).

  • Upgrading your db? Okay, our app will be unavailable for minutes or hours, and then you need to figure out why something doesn't work.

  • Storage - You can upload from RN, but first, you need to convert the entire file into base64 and then use ArrayBuffer.

But yes, Supabase is open-sourced. I guess I'll tell my clients - our app doesn't work properly, but at least you can read the source code.

ahanusek avatar Mar 22 '24 06:03 ahanusek

@waylon999 You can set the timeout to greater than 5000ms. That's a limitation imposed by the UI. You can write the webhooks directly in pg_net or use the supabase_functions.http_request helper function.

does this solve the problem though? Seems fragile regardless of timeout (esp if unclear if timeout will be 5 seconds or 30)

ham-evans avatar Mar 28 '24 21:03 ham-evans

Any new updates on this?

joaopegoraro avatar May 02 '24 19:05 joaopegoraro

Any new updates on this?

robertbodley avatar May 10 '24 20:05 robertbodley

pg_net has been super inconsistent. also tends to hang up sporadically. tried to build an event-driven app but decide to give up & rely on cron job instead

kevinlooESO avatar Jun 10 '24 03:06 kevinlooESO

Is there any ETA on this? It's blocking me and I would need to do some workaround. Thanks!

matiastucci avatar Jun 17 '24 10:06 matiastucci