graphql-tools icon indicating copy to clipboard operation
graphql-tools copied to clipboard

Possible memory leak using experimental feature Execution Cancellation

Open gunters63 opened this issue 3 months ago • 26 comments

Issue workflow progress

Progress of the issue based on the Contributor Workflow

  • [ ] 1. The issue provides a reproduction available on Github, Stackblitz or CodeSandbox

    Make sure to fork this template and run npm run generate in the terminal.

    Please make sure the GraphQL Tools package versions under package.json matches yours.

  • [ ] 2. A failing test has been provided
  • [ ] 3. A local solution has been provided
  • [ ] 4. A pull request is pending review

Describe the bug

While analyzing a possible memory leak in a graphql-yoga server with Node.js heap profiling and Chrome dev-tools I saw some suspicious results which could indicate a memory leak in @graphql-tools/executor.

The first screen shot shows the comparison of two heap snapshots of the service, the latter one (Snapshot 4) received about 10000 mutation requests more than Snapshot 3. Memory is very stable except some new allocations, see the red frame. The number of new allocations is a multiple (x2, x4 or x6) of the received requests.

Image

If I drill down the source of the allocations, it points to this place (@graphql-tools/executor/cjs/execution/execute.js):

Image

To Reproduce Steps to reproduce the behavior:

If these are really some leaking event listeners every graphql-yoga server using @graphql-tools/executor should leak a little memory over time.

Expected behavior

No leaking listeners

Environment:

  • OS: Arch Linux, Kernel 6.17
  • @graphql-tools/...:
  • "@graphql-tools/executor-http": "3.0.4",
  • "@graphql-tools/merge": "9.1.1",
  • "@graphql-tools/schema": "10.0.25",
  • "@graphql-tools/stitch": "10.0.2",
  • "@graphql-tools/utils": "10.9.1",
  • "@graphql-tools/wrap": "11.0.1",
  • "graphql": "16.11.0",
  • "graphql-fields": "2.0.3",
  • "graphql-scalars": "1.24.2",
  • "graphql-tag": "2.12.6",
  • "graphql-yoga": "5.16.0",
  • NodeJS: 24.9.0

Additional context

Sorry that I didn´t provide tests or a reproduction repo (yet). Tests may be very difficult to write and if there is really a leaking listener problem one should be able to reproduce the issue (with Node.Js heap profiler and Chrome dev tools heap snapshot comparison) with every program using @graphql-tools/executor.

Maybe I am totally wrong here and the behavior I see in the snapshot comparison is normal.

gunters63 avatar Oct 08 '25 15:10 gunters63

First of all, thank you for creating the issue! It would be great if we can get some bare minimum setup that reproduces the issue because we have different test suites to check any potential memory leaks both in this repo and other projects that are using Yoga; Tools -> https://github.com/ardatan/graphql-tools/blob/master/.github/workflows/tests.yml#L115 Yoga -> https://github.com/graphql-hive/graphql-yoga/blob/main/.github/workflows/ci.yml#L128 Hive Gateway -> https://github.com/graphql-hive/gateway/blob/main/.github/workflows/memtest.yml

About that event listener, As you can see in the code, the event listener is registered when signal is set in the execution parameters; Here -> https://github.com/ardatan/graphql-tools/blob/master/packages/executor/src/execution/execute.ts#L499 And GraphQL Yoga sets it when this plugin is used; https://the-guild.dev/graphql/yoga-server/docs/features/execution-cancellation Here -> https://github.com/graphql-hive/graphql-yoga/blob/main/packages/graphql-yoga/src/plugins/use-execution-cancellation.ts#L10 So it is disabled by default. That's why it'd be better if we learn more about your setup. Maybe in the meanwhile, you can disable this plugin if you use it until we figure the issue out.

I also didn't see the version of @graphql-tools/executor you have in your code-base so maybe you can check if it is the latest by using a command like yarn why @graphql-tools/executor or any other variant of this depends on your package manager.

cc @enisdenjo

ardatan avatar Oct 08 '25 15:10 ardatan

Thanks for the quick reply. That is a lot of useful information. Maybe I can create a simple example server now which reproduces the error.

Yes, I use useExecutionCancellation(), I will retry the leak test tomorrow without this. It is experimental anyway, I don´t really remember why I activated it in the first place.

Here the why output:

❯ pnpm why @graphql-tools/executor
Legend: production dependency, optional only, dev only

dependencies:
@graphql-tools/stitch 10.0.2
├─┬ @graphql-tools/batch-delegate 10.0.1
│ └─┬ @graphql-tools/delegate 11.0.1
│   └── @graphql-tools/executor 1.4.9
├─┬ @graphql-tools/delegate 11.0.1
│ └── @graphql-tools/executor 1.4.9
├── @graphql-tools/executor 1.4.9
└─┬ @graphql-tools/wrap 11.0.1
  └─┬ @graphql-tools/delegate 11.0.1
    └── @graphql-tools/executor 1.4.9
@graphql-tools/wrap 11.0.1
└─┬ @graphql-tools/delegate 11.0.1
  └── @graphql-tools/executor 1.4.9
@graphql-yoga/render-graphiql 5.16.0
└─┬ graphql-yoga 5.16.0 peer
  └── @graphql-tools/executor 1.4.9
graphql-yoga 5.16.0
└── @graphql-tools/executor 1.4.9

gunters63 avatar Oct 08 '25 18:10 gunters63

Ok, I did two more tests now:

First I stressed it with bombardier using a simple query like:

bombardier -c 3 -d 60s -l -k -H 'content-type: application/json' -m POST -b '{"query":"{findManyLogin {Id} }"}' https://127.0.0.1:4400/graphql/drill.0

This also reproduced the leak with a simple query (i used a custom script using urql and mutations before).

Then I commented out useExecutionCancellation() in my code and re-ran the test. Bombardier made about 40000 requests in one minute.

This was the result:

Image

Absolutely no + deltas anymore, the leak was gone.

gunters63 avatar Oct 09 '25 08:10 gunters63

Thank you for digging into this, we really much appreciate it! I will be trying to reproduce this leak on our side and will come back to you with my findings.

enisdenjo avatar Oct 09 '25 09:10 enisdenjo

Ok, I tried a lot of things, I could not get it to leak. 🤔

I made a PR adding a memtest for both Hive Gateway (powered by Yoga) and Yoga itself with execution cancellation. No leaks reported. The memtests work the same way you test, it does a few runs of loadtesting, GCs in between runs, takes heap snapshots and then compares the deltas of each heap snapshot.

I did my local testing matching your Node version: 24.9.0.

You can run the memtests locally yourself with (make sure you have k6 installed):

MEMTEST=clean,keepheapsnaps,short yarn test:mem run execution-cancellation -t yoga

if it passes, you can double-check by loading, in order, the e2e/execution-cancellation/*.heapsnapshots into your DevTools and comparing them as you would.

Maybe Yoga setup for the test is too simple. Maybe there is something else happening.

If the tests pass for you locally too, we would kindly ask you to give more insights or create a reproduction so that we can test exactly that scenario. Thanks for helping us debug this!

enisdenjo avatar Oct 09 '25 18:10 enisdenjo

Ok,I got this to run on my dev machine.

I will gradually expand yoga-server.ts with my setup now until I (hopefully) can reproduce the problem.

gunters63 avatar Oct 10 '25 08:10 gunters63

The first dependency from my service I added was fastify. So I modified yoga-server.ts from the test (see Fastify example:

import fastify, { FastifyReply, FastifyRequest } from 'fastify';
import {
  createSchema,
  createYoga,
  useExecutionCancellation,
  YogaInitialContext,
} from 'graphql-yoga';

export const schema = createSchema({
  typeDefs: /* GraphQL */ `
    type Query {
      hello: String
    }
  `,
  resolvers: {
    Query: {
      hello: () => 'world',
    },
  },
});

// This is the fastify instance you have created
const server = fastify({ logger: true });
// This will allow Fastify to forward multipart requests to GraphQL Yoga
server.addContentTypeParser('multipart/form-data', {}, (_req, _payload, done) => done(null))

// Create a Yoga instance with a GraphQL schema.
const yoga = createYoga<{
  req: FastifyRequest;
  reply: FastifyReply;
} | YogaInitialContext>({
  schema,
  plugins: [useExecutionCancellation()],
  logging: {
    debug: (...args) => args.forEach((arg) => server.log.debug(arg)),
    info: (...args) => args.forEach((arg) => server.log.info(arg)),
    warn: (...args) => args.forEach((arg) => server.log.warn(arg)),
    error: (...args) => args.forEach((arg) => server.log.error(arg)),
  },
});

server.route({
  // Bind to the Yoga's endpoint to avoid rendering on any path
  url: yoga.graphqlEndpoint,
  method: ['GET', 'POST', 'OPTIONS'],
  handler: (req, reply) =>
    yoga.handleNodeRequestAndResponse(req, reply, {
      req,
      reply,
    }),
});

// Start the server and you're done!
server.listen({ port: parseInt(process.env['PORT']!) }, (_error, serverUrl) => {
  console.info(`Server is running on ${serverUrl}${yoga.graphqlEndpoint}`);
});

Unfortunately, when I ran

MEMTEST=clean,keepheapsnaps,short yarn test:mem run execution-cancellation -t Yoga

I got an error from the test framework:

time="2025-10-10T11:24:16+02:00" level=error msg="test aborted: Status is not 200, got status 415 and body:\n{\"statusCode\":415,\"code\":\"FST_ERR_CTP_INVALID_MEDIA_TYPE\",\"error\":\"Unsupported Media Type\",\"message\":\"Unsupported Media Type: application/x-www-form-urlencoded\"} at default (file:///home/gunter/tmp/gateway/internal/perf/src/graphql-loadtest-script.ts:27:24(70))"

Not sure what causes this, if I start yoga-server.ts directly it behaves as expected.

gunters63 avatar Oct 10 '25 09:10 gunters63

The next dependencies I'll try to add (if Fastify doesn´t trigger the memory leak) are:

  • @graphql-tools/stitch
  • @graphql-tools/wrap
  • @graphql-tools/executor-http

gunters63 avatar Oct 10 '25 09:10 gunters63

You can modify the internal/perf/src/graphql-loadtest-script.ts to match your servers requirements. Could be that the http.post somehow sends the wrong header assuming to post a form instead of json, are you running the latest k6?

enisdenjo avatar Oct 10 '25 09:10 enisdenjo

Did this to ensure content-type json, can you try with those changes?

enisdenjo avatar Oct 10 '25 09:10 enisdenjo

Just found this myself, yes running now :)

Results:

Image

gunters63 avatar Oct 10 '25 09:10 gunters63

Doesn't seem to be the same memory leak though (string, concatenated-string)

gunters63 avatar Oct 10 '25 09:10 gunters63

And this leak is still there when I remove the useExecutionCancellation plugin.

I continue adding more dependencies now.

gunters63 avatar Oct 10 '25 09:10 gunters63

(string) and (concatenated-string) are not really what we're looking for here. The former is just a standard string allocated in memory and the latter is a special v8 type of representation used to speed up string concatenations. It's weird but not really something to worry about, it should not have anything to do with the issue at hand here.

Keep us updated!

enisdenjo avatar Oct 10 '25 10:10 enisdenjo

I didn´t have much time, but today I made a little experiment. I started my server and made a few simple queries.

I put break points in the execute function here:

Image

During a query the code only passed through the line at (2), it never passed though (1). So, if the signal never gets the abort request and the signal is the same every time, doesn´t that mean that event listeners are leaking?

I will continue my experiments when I have more time :)

gunters63 avatar Oct 14 '25 07:10 gunters63

A little bit debugging more, after 6 requests:

Image

Indeed, the same signal is used every time and the listeners are piling up.

gunters63 avatar Oct 14 '25 08:10 gunters63

Btw, the stack trace is veeery long to this place here, but I saw batching (which I activated) and some proxying stuff (I use stitching) on the way down, maybe I'll add this to the leak-test

gunters63 avatar Oct 14 '25 08:10 gunters63

Hey @gunters63, just circling back! Any chance you got those inside the test or can share the leaking setup? I'd like to handle this as soon as possible. Thanks!

enisdenjo avatar Oct 17 '25 09:10 enisdenjo

Yeah, hopefully I have some time later today

gunters63 avatar Oct 17 '25 12:10 gunters63

So I started the yoga-server.ts in the execution-cancellation directory manually with node --import tsx yoga-server.ts and made some requests. I noticed that here too it never called signal.removeEventListener.

So, in my theory, it should also leak event listeners here and I wondered why it doesn't.

Then I saw the following in the stack-trace:

Image

So, the memory leak test doesn't seem to use a real AbortController/AbortSignal, but some kind of Test helper. Maybe this test helper does not leak when adding eventlisteners but never removing them?

I didn´t find an easy to way to run the memory leak test with a real AbortController/AbortSignal.

gunters63 avatar Oct 19 '25 10:10 gunters63

I am getting close :)

I put the following debugging code directly into execute.js:

Image

and of course a let lastSignal = null at the beginning of the module.

When I made requests to the yoga-server.ts in the execution-cancellation directory, I got this output:

❯ PORT=4000 node --inspect --import tsx yoga-server.ts
Debugger listening on ws://127.0.0.1:9229/23e8c889-a6a4-40aa-8b65-a618e99a4cb5
For help, see: https://nodejs.org/en/docs/inspector
Server is running on http://localhost:4000/graphql
different signal
adding listener
different signal
adding listener
different signal
adding listener
different signal
adding listener

When I did some requests on my yoga-server, I got this:

different signal
adding listener
same signal
adding listener
same signal
adding listener
same signal
adding listener
same signal
adding listener

Somehow in my server the same signal is used every time and that leaks the event listeners.

gunters63 avatar Oct 20 '25 08:10 gunters63

I'll try to play around and add more stuff to the test yoga-server.ts, maybe I find the reason of the shared signal.

Btw, here is how I create the yoga server:

Image

gunters63 avatar Oct 20 '25 08:10 gunters63

I added some of those options to the test yoga-server in the exection cancellation test directory, but he still doesn´t re-use the signal. It also doesn´t happen when I add fastify.

In my code I never use AbortSignal or AbortController, so I have currently no idea why in my server the same signal is used every query.

gunters63 avatar Oct 20 '25 08:10 gunters63

It seems to have to do something with @graphql-tools/executor-http

I have a 'global' route where I just serve a schema with Yoga. This route does use a different AbortController for every request:

different signal
adding listener
different signal
adding listener
different signal
adding listener

I have a few other routes where I stitch some schemas for different instances of a Prisma server where I use buildHTTPExecutor to forward the sub schema request.

If I am using those routes:

adding listener
same signal
adding listener
same signal
adding listener
same signal

For this internal forwarding I use a line like this:

  const globalExec = buildHTTPExecutor({
    endpoint: urlPrefix + globalRouteOptions.url,
    // eslint-disable-next-line @typescript-eslint/unbound-method
    fetch: globalYoga.fetch, // use Yogas fetch, this does not send an actual HTTP request!
  });

I also saw this in buildHttpExecutor:

Image

Maybe this is the place where the single instance AbortController happens.

I'll try to replicate this in the test.

gunters63 avatar Oct 22 '25 15:10 gunters63

I tried to reproduce the problem in a minimal server, so I modified yoga-server.ts like this:

import { buildHTTPExecutor } from '@graphql-tools/executor-http';
import { stitchSchemas } from '@graphql-tools/stitch';
import fastify, { FastifyReply, FastifyRequest } from 'fastify';
import {
  createSchema,
  createYoga,
  useExecutionCancellation,
  YogaInitialContext,
} from 'graphql-yoga';

export const schema = createSchema({
  typeDefs: /* GraphQL */ `
    type Query {
      hello: String
    }
  `,
  resolvers: {
    Query: {
      hello: () => 'world',
    },
  },
});

// This is the fastify instance you have created
const server = fastify({ logger: true });
// This will allow Fastify to forward multipart requests to GraphQL Yoga
server.addContentTypeParser('multipart/form-data', {}, (_req, _payload, done) =>
  done(null),
);

// Create a Yoga instance with a GraphQL schema.
const subYoga = createYoga<
  | {
      req: FastifyRequest;
      reply: FastifyReply;
    }
  | YogaInitialContext
>({
  schema,
  parserAndValidationCache: false,
  graphqlEndpoint: '/stitched'
});

server.route({
  // Bind to the Yoga's endpoint to avoid rendering on any path
  url: subYoga.graphqlEndpoint,
  method: ['GET', 'POST', 'OPTIONS'],
  handler: (req, reply) =>
    subYoga.handleNodeRequestAndResponse(req, reply, {
      req,
      reply,
    }),
});

const executor = buildHTTPExecutor({
  endpoint: `https://127.0.0.1:${process.env['PORT']}/stitched`,
  fetch: subYoga.fetch,
});

const stitchedSchema = stitchSchemas({
  subschemas: [
    {
      schema,
      executor,
    },
  ],
});

const mainYoga = createYoga<
  | {
      req: FastifyRequest;
      reply: FastifyReply;
    }
  | YogaInitialContext
>({
  schema: stitchedSchema,
  parserAndValidationCache: false,
  plugins: [useExecutionCancellation()],
    logging: {
    debug: (...args) => args.forEach((arg) => server.log.debug(arg)),
    info: (...args) => args.forEach((arg) => server.log.info(arg)),
    warn: (...args) => args.forEach((arg) => server.log.warn(arg)),
    error: (...args) => args.forEach((arg) => server.log.error(arg)),
  },

});

server.route({
  // Bind to the Yoga's endpoint to avoid rendering on any path
  url: mainYoga.graphqlEndpoint,
  method: ['GET', 'POST', 'OPTIONS'],
  handler: (req, reply) =>
    mainYoga.handleNodeRequestAndResponse(req, reply, {
      req,
      reply,
    }),
});

// Start the server and you're done!
server.listen({ port: parseInt(process.env['PORT']!) }, (_error, serverUrl) => {
  console.info(`Stitched server is running on ${serverUrl}${mainYoga.graphqlEndpoint}`);
});

This minimal server has now almost all the parts my real server has.

I still could not reproduce the problem, and the logging statements I added in @graphql-tools/executor-http told me that the AbortController was not shared:

different signal
adding listener
{"level":30,"time":1761308226175,"pid":159743,"hostname":"taubenus","reqId":"req-6","res":{"statusCode":200},"responseTime":1.806793000549078,"msg":"request completed"}
{"level":30,"time":1761308226348,"pid":159743,"hostname":"taubenus","reqId":"req-7","req":{"method":"POST","url":"/graphql","host":"localhost:4000","remoteAddress":"127.0.0.1","remotePort":35726},"msg":"incoming request"}
different signal
adding listener
{"level":30,"time":1761308226350,"pid":159743,"hostname":"taubenus","reqId":"req-7","res":{"statusCode":200},"responseTime":1.799499001353979,"msg":"request completed"}
{"level":30,"time":1761308226504,"pid":159743,"hostname":"taubenus","reqId":"req-8","req":{"method":"POST","url":"/graphql","host":"localhost:4000","remoteAddress":"127.0.0.1","remotePort":35726},"msg":"incoming request"}
different signal
adding listener

I have no idea why the AbortController is not shared in my server (which triggers the memort leak). I give up now, but I still think there is a lingering problem here.

gunters63 avatar Oct 24 '25 12:10 gunters63

Oh boy, that one is weird! Thank you for working on this, I followed your progress and we will pick this up as soon as we can. I'll try to implement what you shared and get a repro, your insights are helpful.

enisdenjo avatar Oct 27 '25 13:10 enisdenjo