workers-rs icon indicating copy to clipboard operation
workers-rs copied to clipboard

[BUG] Memory leak in Rust Durable Object eviction

Open lukevalenta opened this issue 8 months ago • 2 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

What version of workers-rs are you using?

0.5.0

What version of wrangler are you using?

4.10.0

Describe the bug

In Rust Workers, memory allocated for a Durable Object is not freed upon eviction. I'm able to reproduce locally with miniflare, and seem to have hit this same issue in a production worker (https://github.com/cloudflare/azul/issues/11).

Steps To Reproduce

With the following files:

cat Cargo.toml
[package]
name = "memory-leak"
version = "0.1.0"
edition = "2021"

[package.metadata.release]
release = false

# https://github.com/rustwasm/wasm-pack/issues/1247
[package.metadata.wasm-pack.profile.release]
wasm-opt = false

[lib]
crate-type = ["cdylib"]

[dependencies]
worker = { version = "0.5.0" }

cat wrangler.jsonc
{
	"name": "memory-leak",
    "main": "build/worker/shim.mjs",
	"build": {
		"command": "cargo install -q worker-build && worker-build --release"
	},
	"compatibility_date": "2025-04-08",
	"migrations": [
		{
			"new_sqlite_classes": [
				"MyDurableObject"
			],
			"tag": "v1"
		}
	],
	"durable_objects": {
		"bindings": [
			{
				"class_name": "MyDurableObject",
				"name": "MY_DURABLE_OBJECT"
			}
		]
	},
	"observability": {
		"enabled": true
	}
}

cat src/lib.rs
use wasm_bindgen::prelude::*;
#[allow(clippy::wildcard_imports)]
use worker::*;

#[event(fetch, respond_with_errors)]
async fn main(_req: Request, env: Env, _ctx: Context) -> Result<Response> {
    let ns = env.durable_object("MY_DURABLE_OBJECT")?;
    let id = ns.id_from_name("foo")?;
    let stub = id.get_stub()?;
    stub.fetch_with_str("http://example.com").await
}

#[durable_object]
struct MyDurableObject {
    count: u64,
    _buffer: Vec<u8>,
}

#[durable_object]
impl DurableObject for MyDurableObject {
    fn new(state: State, _env: Env) -> Self {
        Self {
            count: 0,
            _buffer: Vec::with_capacity(100_000_000),
        }
    }
    #[allow(clippy::unused_async)]
    async fn fetch(&mut self, mut _req: Request) -> Result<Response> {
        self.count += 1;
        Response::ok(format!("hello {}", self.count))
    }
}

Then, run npx wrangler dev and hit the d key to open devtools. Now, take a heap memory snapshot, and then run the following in another terminal window to repeatedly re-initialize the DO (after it is evicted after 10s of inactivity):

while true; do curl http://localhost:8787; sleep 10; done

Take memory snapshots after every request and we can see it's continuously increasing by 100MB: Image

Now, with a JavaScript worker, we can see that there is no leak:

cat package.json
{
	"name": "memory-leak-js",
	"version": "0.0.0",
	"private": true,
	"scripts": {
		"deploy": "wrangler deploy",
		"dev": "wrangler dev",
		"start": "wrangler dev"
	},
	"devDependencies": {
		"wrangler": "^4.9.1"
	}
}
cat wrangler.jsonc
{
	"$schema": "node_modules/wrangler/config-schema.json",
	"name": "memory-leak-js",
	"main": "src/index.js",
	"compatibility_date": "2025-04-08",
	"migrations": [
		{
			"new_sqlite_classes": [
				"MyDurableObject"
			],
			"tag": "v1"
		}
	],
	"durable_objects": {
		"bindings": [
			{
				"class_name": "MyDurableObject",
				"name": "MY_DURABLE_OBJECT"
			}
		]
	},
	"observability": {
		"enabled": true
	}
}
cat src/index.js
import { DurableObject } from "cloudflare:workers";

export class MyDurableObject extends DurableObject {
	constructor(ctx, env) {
		super(ctx, env);
		this.count = 0;
		this.buffer = new Uint8Array(100_000_000)
	}

	async fetch() {
		this.count += 1
		return new Response('hello ' + this.count);
	}
}

export default {
	async fetch(request, env, ctx) {
    console.log('fetch');
		const id = env.MY_DURABLE_OBJECT.idFromName("foo");
		const stub = env.MY_DURABLE_OBJECT.get(id);
		return await stub.fetch("http://example.com");
	},
};

In devtools, I see this memory usage. The final snapshot is after I allowed the DO to be fully evicted (killing the script from earlier):

Image

lukevalenta avatar Apr 14 '25 17:04 lukevalenta

Turns out this is a known bug, and should be solved soon when support for FinalizationRegistry is added.

See https://github.com/cloudflare/workers-rs/pull/653 for one possible workaround, although it didn't seem to help in the example above.

lukevalenta avatar Apr 14 '25 18:04 lukevalenta

Following up, it doesn't look like Finalization Registry support changed anything here. I'm still able to reproduce this with worker 0.6.5 and with compatibility date set to today: https://github.com/lukevalenta/workers-rs/tree/lvalenta/do-alarm-memory-leak/do-heap-memory-leak.

Image

lukevalenta avatar Sep 23 '25 17:09 lukevalenta

Following up, it doesn't look like Finalization Registry support changed anything here. I'm still able to reproduce this with worker 0.6.5 and with compatibility date set to today: https://github.com/lukevalenta/workers-rs/tree/lvalenta/do-alarm-memory-leak/do-heap-memory-leak.

Image

Thank you for the extensive writeup and details @lukevalenta Question: Have you confirmed this memory leak occurs when deployed to Cloudflare production, or only locally with miniflare? Reason: I've encountered several DO anomalies that only manifest locally but work fine in production, so I'm curious if this is another case of local/production divergence.

If you've verified this in production (e.g., with the cloudflare/azul#11 worker you mentioned), that's definitely critical to know. - Thanks in advance.

PeterMHammond avatar Sep 23 '25 17:09 PeterMHammond

Hi @PeterMHammond yes, I've confirmed that this happens in a deployed worker, although you need to extend the timeout a little longer than 10s to get the Durable Object to hibernate (15s seems to work). Repro code is at https://github.com/lukevalenta/workers-rs/tree/lvalenta/do-alarm-memory-leak/do-heap-memory-leak and the deployed worker at https://do-heap-memory-leak.luke-valenta.workers.dev/.

I've also found a slower memory leak that I can trigger by spinning an alarm in a 1s loop: https://github.com/lukevalenta/workers-rs/tree/lvalenta/do-alarm-memory-leak/do-alarm-memory-leak. I've been able to confirm both via internal metrics and with npx wrangler dev and the devtools memory profile (see heap usage increasing in image below):

Image

I'm following up with the team internally on both of these.

lukevalenta avatar Sep 23 '25 17:09 lukevalenta

Understood @lukevalenta - thank you very much for the confirmation. I use durable object alarms extensively so this is a big concern as we're going live 10/25 with our new implementation and if I understand this correctly it could have very bad impact on our billing?

PeterMHammond avatar Sep 23 '25 18:09 PeterMHammond

Hey @PeterMHammond. EM for Durable Objects here. From our pricing page, we state that:

Duration billing charges for the 128 MB of memory your Durable Object is allocated, regardless of actual usage. If your account creates many instances of a single Durable Object class, Durable Objects may run in the same isolate on the same physical machine and share the 128 MB of memory. These Durable Objects are still billed as if they are allocated a full 128 MB of memory.

The presence of a memory leak is something that we take seriously, but the impact to customers is that the available memory to an application will decrease over time, not that customers bills will grow. Once the available 128 MB of memory is exhausted, the isolate will be condemned and your application will get a fresh isolate on a subsequent request.

joshthoward avatar Sep 23 '25 18:09 joshthoward

Once the available 128 MB of memory is exhausted, the isolate will be condemned and your application will get a fresh isolate on a subsequent request.

Always love when my isolate is condemned - what does that mean for a practical experience?

Using a very bad example of a clock ticker - sending SSE updates to the client each second. I assume it would mean that an active alarm that is updating my UX would just stop functioning when it gets "condemned" - with the user having no way to know why or how to restore it, with the end user thinking my super awesome clock is frozen and my billion dollar dream of selling my AI awesome clock is dead (had to put AI in it to sell the investors)?

PeterMHammond avatar Sep 23 '25 18:09 PeterMHammond

@PeterMHammond speaking from experience as a DO user and not authoritatively, when the isolate hits its memory limit the Durable Object will be re-initialized (and new called again) within a very short period of time. Durable Objects get evicted for all sorts of reasons (e.g., the metal its running on needs to be rebooted), and this is mostly transparent to users, but in-flight requests might be cancelled, and alarms might be slightly delayed.

From the docs:

Alarms have guaranteed at-least-once execution and are retried automatically when the alarm() handler throws.

Retries are performed using exponential backoff starting at a 2 second delay from the first failure with up to 6 retries allowed.

lukevalenta avatar Sep 23 '25 18:09 lukevalenta

Alarms have guaranteed at-least-once execution and are retried automatically

@lukevalenta - I understand the retry behavior, but correct me if I'm wrong: if a DO is evicted during an alarm handler execution (before it can schedule the next alarm), that next alarm will never be set, so the DO won't be re-initialized. My understanding:

  • DOs can only have 1 alarm at a time
  • When an alarm fires, I must schedule the next one within the handler
  • The exponential backoff/guaranteed execution only applies if the next alarm has been properly set - not if the DO is evicted mid-execution

So if eviction happens between the alarm firing and my code scheduling the next alarm, the chain breaks. Is that correct?

PeterMHammond avatar Sep 23 '25 20:09 PeterMHammond

Hi @PeterMHammond - The word 'condemned' is a word we use internally but really has no bearing on what you experience as a user. It's not as scary as it sounds! It really just means that we reset it. In the case of a 'normal' Worker (not a Durable Object) you generally won't even notice anything happening. We typically allow requests which are in flight to 'grain' and we spin up a new isolate in the background to receive new requests.

With Durable Objects it's slightly, but not much, more complicated because in order to guarantee uniqueness, we only allow a single instance of a given object to exist at once. This means at some point it will have to reset.

The Durable Objects programming model makes this very easy to deal with. Specifically, if your object resets for any reason (memory limit, machine failure, meteorite, etc) midway through execution, anything mutated that has not yet been observed externally (for example through I/O) is rolled back and in the case of an alarm handler, is retried. Think of this just like a database transaction.

So in your alarm handler example, if the Durable Object resets before arming the alarm again, the alarm handler will just be invoked again.

https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/ https://developers.cloudflare.com/durable-objects/api/alarms/

byule avatar Sep 23 '25 23:09 byule

So in your alarm handler example, if the Durable Object resets before arming the alarm again, the alarm handler will just be invoked again.

That makes perfect sense and is fantastic news, and a big relief! Sincere thanks to you, @guybedford and @lukevalenta, as I see there is already a fix going out - so this is all fantastic. We made a big bet on DO's as well as Vectorize, KV, D1, and more, so we are very excited that this won't jeopardize our V1 rollout on 10/25!

'condemned' - It's not as scary as it sounds!

Note to self: my 'condemned isolate' joke needs work 😅 But genuinely appreciate the clarification!

PeterMHammond avatar Sep 24 '25 00:09 PeterMHammond

I was able to verify and resolve a memory leak here in https://github.com/cloudflare/workers-rs/pull/822, and have released this in the latest 0.6.6 and worker-build 0.1.9.

Thank you to @lukevalenta for helping to track this down.

@PeterMHammond we're very much prioritising ongoing maintenance here, glad to help, and please do report any further issues.

guybedford avatar Sep 24 '25 00:09 guybedford

Just following up, I updated the example at https://github.com/lukevalenta/workers-rs/tree/f2899fe14a37cd6bce81edc6f08d4f1b7819fec1/do-heap-memory-leak with worker 0.6.6 and reduced the buffer size from 100MB to 10MB to give more granularity in the memory profile. Memory usage plateaus at about 70MB (maybe reaching some equilibrium with garbage collection, etc.).

Given that, it looks like this issue is resolved so I'll go ahead and close this out. Thanks @guybedford!

(The alarm loop memory leak still seems to be at large, and I'm currently writing up a new issue for that.)

lukevalenta avatar Sep 24 '25 19:09 lukevalenta

I think I closed this prematurely. @guybedford would you be able to re-open? (I don't have permissions. EDIT: nevermind, looks like I do have the right permissions)

I thought memory usage plateaued at 70MB, but after longer observation it keeps going up until the worker hits the memory limit. I'm seeing this in local dev too.

lukevalenta avatar Sep 24 '25 20:09 lukevalenta

@lukevalenta are you able to confirm the increments here match the heap allocation size so that it is this same leak?

guybedford avatar Sep 24 '25 21:09 guybedford

@lukevalenta are you able to confirm the increments here match the heap allocation size so that it is this same leak?

Yep, the increments match the heap allocation size -- 10MB in this case. There are also some smaller increments from https://github.com/cloudflare/workers-rs/issues/827 too.

From the internal dashboard for the worker I have deployed at https://do-heap-memory-leak.crypto-team.workers.dev/:

Image

lukevalenta avatar Sep 24 '25 21:09 lukevalenta

@lukevalenta do you have a local replication of this case the same as previously?

guybedford avatar Sep 24 '25 21:09 guybedford

@lukevalenta do you have a local replication of this case the same as previously?

Yep, I can replicate locally.

lukevalenta avatar Sep 24 '25 22:09 lukevalenta

After examining this more closely, it turns out the leak is resolved and there isn't a residual leak here. Instead what is happening is:

  1. Allocations increase the Wasm heap size by allocating extra pages
  2. GC isn't triggered eagerly, so allocations continue to add up
  3. When GC finally runs, Wasm memory is free'd but pages are not deallocated (Wasm doesn't support memory.shrink)
  4. So memory bounds increase based on GC "infrequency", but the bound is still an upper bound and not formally a leak

We could add eager free-ing behaviour for this specific case of hibernation and reinitialization, but I'm also hestitant to add GC exceptions unless this case is justified in this regard, since formally this is a natural part of the GC model.

guybedford avatar Sep 26 '25 00:09 guybedford

I've posted up a memory improvement here (not formally a leak fix), that can reduce the memory footprint of Rust workers subject to hibernation. https://github.com/cloudflare/workers-rs/pull/832

guybedford avatar Sep 26 '25 01:09 guybedford

Closing as resolved, but will continue to investigate the possible memory optimization here in https://github.com/cloudflare/workers-rs/pull/832.

guybedford avatar Sep 26 '25 01:09 guybedford