sentry-javascript
sentry-javascript copied to clipboard
Document how to write a standalone AWS Lambda tunnel
Problem Statement
Background
We are running Sentry in production and have been for a while now (~2 years or so). Over the course of the last year, we have moved our web apps to use AWS Lambda for the server (from AWS ECS + Fargate).
To avoid trouble with ad-blockers, we have configured tunnelling. Maybe you can see where this is going. 😄
The Problem
Tunnelling creates a route on the server, which, in this case, means that it redirects the events to the server running on AWS Lambda. That's fine in principle–lambda invocations are very cheap! However, the problem is that there are a lot of events and they can be very bursty, which means that they can "clog up" all the warm lambdas quite easily.
Anyone who has tried deploying a NextJS server in lambda knows that the cold-start times are punishing. There are ways to mitigate this (that we employ) such as provisioned concurrency and orchestrating another fleet of lambdas to regularly ping the server lambda to keep N copies warm. That's all fine, but it's all for nothing when Sentry events hog all the warm lambdas anyway.
Solution Brainstorm
What I'd like to see is an option to configure Sentry to send events in batches, especially in combination with server-side tunnelling.
Batches could be defined in various ways but the two that seem most useful to me are:
- number of events, e.g. every 10 events
- a maximum duration, e.g. at least every 30s (regardless of how many there are)
Having these as either/or choices would be a significant gain that could save us a lot of cold starts, time and money every month. If we could use both together, e.g. send after 10 events or after 30 seconds, that would be even better.
The Sentry client would be responsible for holding the events in memory until it has dispatched them.
Hi @jxdp thanks for writing in! Adding some kind of batch-processing to the SDK would require a lot of core functionality changes which I don't think we're going to add in the near to medium future. Besides, even if an SDK batched events instead of sending them out one by one, you'd still receive a lot of requests from the individual clients. So I'm not sure if this would solve the problem reliably.
~Instead, I'm wondering (definitely no serverless expert, so take it with a grain of salt) if setting up the tunnel as a serverless function is a good approach in general~ (Update: I guess it sort of makes sense but see my message below for clarifying questions)
Talked a bit about this with the team and I have a few additional questions:
- Did you define your tunnel as a server/API route within your NextJS project or is it its "own" lambda that's hosted separately from your NextJS app?
- Can you show us how your tunnel route forwards the request to Sentry?
Maybe the solution here is to provide an example for a (generic, none- Next.js) Lambda function that acts as a tunnel in our tunnel examples repo.
One more question: Did you configure the tunnel just for the client or also for the server (i.e. the part that gets converted into lambdas)? It's only necessary on the client as server events shouldn't be ad-blocked. Meaning, your server setup shouldn't need a tunnel.
@Lms24 Thanks for your thoughts and comments!
Understood that it isn't on the current roadmap, but hey, at least now you know of at least 1 customer who'd appreciate the feature. 😄
It's a fair point that even if we batched, we'd still receive a lot of batches (one per client), but it would still offer a significant reduction during heavy load: N clients * 1 request per batch * 1 batch of M events = N requests N clients * 1 request per event * M events = N*M requests
Our config is as shown below. We aren't doing anything fancy or special–we are relying on the Sentry SDK to create the tunnel route via the next.config.js Sentry options (we don't specify anything extra in the client config and it "just works").
To package the server for deployment we use open-next. This takes the NextJS build output and wraps it in a lambda handler–it behaves exactly like any other NextJS server.
next.config.js
module.exports = withSentryConfig(
{
...nextConfig,
sentry: {
autoInstrumentAppDirectory: false,
autoInstrumentMiddleware: false,
autoInstrumentServerFunctions: false,
hideSourceMaps: true,
widenClientFileUpload: true,
tunnelRoute: process.env.SENTRY_TUNNEL,
},
},
{
authToken: process.env.SENTRY_AUTH_TOKEN,
org: process.env.SENTRY_ORG,
project: process.env.SENTRY_PROJECT,
deploy: {env: process.env.NEXT_PUBLIC_TARGET_ENV},
dryRun: process.env.NEXT_PUBLIC_TARGET_ENV !== "production" && process.env.NEXT_PUBLIC_TARGET_ENV !== "staging",
// debug: process.env.NEXT_PUBLIC_TARGET_ENV === "development",
},
sentry.client.config.js
init({
dsn: DSN,
enabled: ENABLED,
environment: ENVIRONMENT,
replaysSessionSampleRate: 0,
replaysOnErrorSampleRate: 1,
integrations: [
new Replay({maskAllInputs: true}),
new BrowserTracing({tracingOrigins: ["localhost", /cloudfront\.net/]}),
new CaptureConsole({levels: ["error"]}),
new ExtraErrorData({depth: 10}),
],
tracesSampler: (samplingContext) => !!samplingContext.parentSampled || TRACES_SAMPLE_RATE,
});
instrumentOutgoingRequests();
I like your suggestion to use a separate lambda for the tunnelling endpoint. I would like explore that possibility in a bit more detail. I think this would cure any UX issues arising from Sentry-event-related cold-starts (major pain).
What would this require? Off the top of my head we'd need to deploy a separate lambda with the handling logic that the Sentry SDK currently sets up for us and make it accessible on the same domain, say tunnel.mycompany.com. Does the NodeJS Sentry SDK export any of this logic for handling the events in a way that we could consume, or would it need to be rolled by hand by us? The only gotcha I can think of is handling compressed (i.e. binary) events, such as from Replays.
We don't have a full lambda example right now, but it would be great to have one - PRs are welcome to our examples repo! We have an example for express.js and fastify, which can probably be adjusted to a lambda function relatively easily:
https://github.com/getsentry/examples/blob/master/tunneling/expressjs/index.js
I think by deploying this as a standalone function, removing the tunnelRoute config and just setting tunnel: '<lambda url>' in your sentry.client.config.js, this should work fine!
@mydea thanks for pointing me to that, it should be pretty straightforward to translate that into a lambda implementation. I will give it a try. It sounds very promising!
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀
FYI, using a lambda does work, but it also introduces what I would call unessential complexity. Still think it would be better for this to be called on the Sentry side.
Whether it is batching or some other mechanism, there really isn't much of a reason for Sentry to continually make HTTP requests for every single event. I think a good comparison is logging agents, which typically buffer logs and flush them every so often instead of just sending everything as soon as it happens.
there really isn't much of a reason for Sentry to continually make HTTP requests for every single event.
Give our SDKs run directly in your app, we've been mostly sending requests per event, correct. Batching without potentially loosing data is a tough problem to solve and in many environments probably impossible. Therefore, our model for now is to send stuff whenever it's ready. There are of course trade offs as with everything but it's unlikely that this is changing in the near future.
I think a good comparison is logging agents, which typically buffer logs and flush them every so often instead of just sending everything as soon as it happens.
The word "agent" has as couple of meanings but one of Sentry's key differences to other observability providers is that you're not required to host an agent in your network. In case of the tunnel, you will need a small proxy and this needs to be hosted somewhere but really the idea is to just forward Sentry requests to work around ad blockers.
The closest thing to a hosted agent would be to self-host our ingestion service Relay but this usually is only recommended in high-throughput environments or when you're particularly concerned about PII/data leaving your network.
only recommended in high-throughput environments
@Lms24 I opened this issue with a specific focus on serverless environments. In such environments, it seems that the metric that matters is not "requests per second" (ie throughput), but "events per request".
For example, if loading a page triggers N sentry events and I need to fetch something from the server, then that means N+1 warm lambas are needed to service the request without risking a cold start for the user. This leads to additional cold starts when traffic is spiking up and there aren't many warm lambas.
I'm not sure I understand why Sentry can't buffer events in the browser before sending them to the tunnel--is this not essentially what "idle transactions" do? For events that originate on the server this wouldn't make sense, but those could also just be batched in one request that is sent at the end.
I'm not sure I understand why Sentry can't buffer events in the browser
There are a couple of challenges with this. Assuming we actually buffer events in the SDK (i.e. errors, spans, maybe replays (which we already buffer btw) and flush them in one go, we'd loose all buffered events if
- users suddenly close the browser
- the browser crashes
- we can't find a way to flush reliably before users reload the page
Now, let's not forget that all of this batching logic would imply bundle size hits which our users are already calling us out on as it is today. Sure, we could make this tree-shakeable to a degree but some bits and pieces will have to be added to the core of the SDK.
These are the client-facing problems which are already hard and were very challenging then we started batching Replay segments. This doesn't even cover the backend changes necessary to accept requests with multiple envelopes.
So at this time, we're not going to look into this.
is this not essentially what "idle transactions" do?
An idle transaction (or span nowadays) only means that the SDK starts a span/transaction which will eventually finish itself if there are no child spans added to it. Theoretically, we have some of the limitations there around browser crashes/closing but other events (e.g. errors) are completely independent and always sent.