pg_graphql icon indicating copy to clipboard operation
pg_graphql copied to clipboard

Result::unwrap() on an Err value: PoisonError { .. }

Open lopezjurip opened this issue 1 year ago • 5 comments

Describe the bug GraphQL queries fails in a non-deterministic way when using this extension in advanced ways. Eg: computed fields, computed relationships, exposing views, calling functions, etc.

To Reproduce It has been really hard to reproduce. It just happens when adding more views and functions to the "public" schema. Maybe it's crashing on high-resource consumption queries?

Steps to reproduce the behavior:

  1. Add resolvers, views and have a lot of tables and extra schemas.
  2. ????
  3. Error will appear (sometimes)
  4. Remove random stuff until you don't see it crashing too often.

Last time i got this error was on this custom handler to query GraphQL with security definer:

create or replace function public.graphql_admin(
  "operationName" text DEFAULT NULL::text,
  query text DEFAULT NULL::text,
  variables jsonb DEFAULT NULL::jsonb,
  extensions jsonb DEFAULT NULL::jsonb
)
returns jsonb
language plpgsql
security definer -- skip RLS
stable
as $$
  declare
    _viewer public.profiles;
  begin
    _viewer := public.viewer_profile_admin(strict := true); -- raise an error if not an admin
    return graphql.resolve(
      query := query,
      variables := coalesce(variables, '{}'),
      "operationName" := "operationName",
      extensions := extensions
    );
  end;
$$;

Expected behavior Do not crash (or at least describe why it's crashing)

Screenshots Screenshot 2024-02-07 at 23 25 59

Versions:

  • PostgreSQL: PostgreSQL 15.1 (Ubuntu 15.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit
  • pg_graphql commit ref: ea2ab60 (version 1.4.2)

Additional context Add any other context about the problem here.

lopezjurip avatar Feb 08 '24 02:02 lopezjurip

PoisonError indicates a panic while a lock is held. See https://doc.rust-lang.org/std/sync/struct.PoisonError.html. pg_graphql doesn't itself use locks but one of its dependencies might. Is it on hosted platform? If yes, could you please share the project ref so that I can view Postgres logs. If not, can you share Postgres logs from your environment.

imor avatar Feb 08 '24 07:02 imor

@imor yes my project is hosted in Supabase. How can I share privately my project ref? I could open a ticket in support service.

lopezjurip avatar Feb 08 '24 12:02 lopezjurip

Yes, opening a ticket would work. Mention this GH issue in it.

imor avatar Feb 08 '24 13:02 imor

Was this issue ever resolved? I, too, am seeing PoisonErrors in my project and am having a very hard time nailing down exactly what is going wrong.

dns-developers avatar Apr 24 '24 21:04 dns-developers

@dns-developers sorry you're running into this

please open an issue with https://supabase.com/dashboard/support/new and include:

  • a link to this thread
  • the impacted project reference
  • be sure to tick the "allow support access" option so we can look at your instance
  • a GraphQL query we can use to reproduce the problem
  • and something along the lines of "I approve an upgrade to the latest version of pg_graphql" so we can upgrade the extension to get the latest debug info if needed (requires no downtime)

olirice avatar Apr 24 '24 21:04 olirice

This is what causes this error:

  1. A graphql query panics somewhere in the code called from functions which are annotated with the cached attribute.
  2. Since the annotated function was called with the lock held, the panic causes the mutex to be poisoned.
  3. The panic is caught by the catch_unwind machinery in pgrx and an error is returned for this graphql query.
  4. Any subsequent queries, when they try to lock the poisoned mutex, fail with the called Result::unwrap() on an Err value: PoisonError { .. } error.

I was able to reproduce this by hardcoding a panic! call inside one of the functions annotated with the cached attribute.

A workaround for this issue is to figure out the query that caused that first error and fix that. It is very likely that that first error is itself a separate bug, so feel free to open an issue for it. Of course, once somebody misses that first error it can be hard to track due to this bug's opaque error message.

To fix this bug's error I'm going to remove sync_writes = true from the cached attributes. This will result in calling the annotated function after the mutex is unlocked. This means any panics will not poison the mutex and subsequent queries will not fail with this error. This should be safe since each query runs in its own backend process which runs in a single thread. @olirice I'm not 100% sure about the multithreaded nature of Postgres. Is my assumption about disabling sync_writes true?

imor avatar Apr 29 '24 08:04 imor