opentelemetry-lambda icon indicating copy to clipboard operation
opentelemetry-lambda copied to clipboard

Slow start-up?

Open a-h opened this issue 1 year ago • 26 comments

I tried out migrating away from AWS's X-Ray SDK for Lambda, but the Open Telemetry Lambda layer appears to add a significant amount to cold start time, which I didn't expect.

It was suggested I cross post, since this is actually the repo with the layers in.

https://github.com/aws-observability/aws-otel-lambda/issues/228#issuecomment-1193215390

Here's the data for reference:

Screenshot 2022-07-24 at 01 19 18

I don't see any documentation on performance, comparison to X-Ray performance etc.

Is there a plan to reduce this? I didn't expect to have to add a Lambda layer to get OpenTelemetry working, I thought it would be included in the Lambda runtime as a first class thing, rather than being a sort-of add-on.

a-h avatar Jul 24 '22 00:07 a-h

Hi, ADOT PM here. Thanks a lot, we're already in the process to dive deep on this and I will report back once we have some shareable data (ETA: early August 2022).

mhausenblas avatar Jul 26 '22 16:07 mhausenblas

Is there an expected startup time for the ADOT collector and instrumentation? Running the latest nodejs layer (1.6.0:1) I am still witnessing 2+ second startups. Tested with both 1024MB and 1536MB memory.

Same test requests go from 3 seconds to 100ms after warming up: image

Sample initialization: image

adambartholomew avatar Oct 05 '22 17:10 adambartholomew

@adambartholomew we've identified the issue with cold starts and are considering ways to address this. Thanks for sharing your data points and we currently do not publish expected startup times.

mhausenblas avatar Oct 05 '22 18:10 mhausenblas

Any update on this? Is there an eta for a fix? Can we expect a solution that is comparable to native EMF?

sam-goodwin avatar Dec 06 '22 05:12 sam-goodwin

Also eager to hear any updates. Is there any workarounds in the meantime?

adambiggs avatar Dec 21 '22 01:12 adambiggs

AWS Snapstart would seem to remove this as a problem. Benchmarking I've done shows cold starts are largely removed as a factor when using SnapStart

Sutty100 avatar Dec 21 '22 11:12 Sutty100

@Sutty100 - SnapStart is Java only, and specifically only java11, so it doesn't solve anything for most people using Lambda.

a-h avatar Dec 21 '22 15:12 a-h

AWS Snapstart would seem to remove this as a problem. Benchmarking I've done shows cold starts are largely removed as a factor when using SnapStart

I believe even if/when it's expanded too, it doesn't currently support or address Lambda Extensions so it wouldn't benefit this issue right now either.

RichiCoder1 avatar Dec 21 '22 21:12 RichiCoder1

Related issue in aws-otel-lambda repo: https://github.com/aws-observability/aws-otel-lambda/issues/228

bilalq avatar Jan 29 '23 06:01 bilalq

@mhausenblas not to be too noisy, but is there any update to this? Or a plan to provide an update? This makes using the Otel layer close to a no-go for a number of latency sensitive cases.

RichiCoder1 avatar Feb 27 '23 18:02 RichiCoder1

@RichiCoder1 no problem at all, yes we're working on it and should be able to share details soon. Overall, our plan is to address the issues in Q1, what we need to verify is to what extent.

mhausenblas avatar Feb 27 '23 19:02 mhausenblas

To give some feedback, I believe this is believed to be due to auto-instrumentation. So you may be able to improve your startup now by building your own, narrower, layer.

tsloughter avatar Mar 03 '23 17:03 tsloughter

@tsloughter - do you mean "the 200ms cold start time is caused by auto-instrumentation"?

I can't see how that could be the case. Since https://opentelemetry.io/docs/instrumentation/go/libraries/ says:

Go does not support truly automatic instrumentation like other languages today.

And the Lambda layer is written in Go.

a-h avatar Mar 03 '23 17:03 a-h

@a-h ah, I didn't see any mention of the language in use. You are right, in Go there is no auto instrumentation.

tsloughter avatar Mar 03 '23 17:03 tsloughter

Hey @mhausenblas - any updates on the timeline for this by chance?

disfluxly avatar May 03 '23 16:05 disfluxly

I was trying to use ADOT with lambda for nodejs+NestJS, but the auto-instrumentation performed by ADOT was adding seconds to the cold start time. @mhausenblas, please let us know if you have any updates on the timeline for this issue.

sangalli avatar May 12 '23 20:05 sangalli

Hi, in out tests we are seeing issues with invocation slow start due to Collector extension registration (~800-2000 ms) and, on emit (POST) of telemetry from the function invocation towards the Collector extension (~200-450 ms).

The initialisation duration will, of course, drop on subsequent invocations but; the POST latency (the ~200 ms) will always remain for all invocations.

Screenshot 2023-07-01 at 15 51 16

Is there are news/update on remedies for this @mhausenblas ?

(is there any suggestion from AWS on the best course of action here with Lambda; is emit via the OTEL SDK [no local Agent] to a central Collector seen as a better go-forward?)

Thanks

ithompson-gp avatar Jul 01 '23 14:07 ithompson-gp

Hi, how are you measuring the latency for the subsequent invocations after the initialization? Is POST the http verb or is it something else?

Since you have a test setup in place, what is the latency when you don't use a layer?

rapphil avatar Jul 06 '23 05:07 rapphil

Hi @rapphil, didn't see your message in July. On the screenshot, there's a red line. Above the line is when I added the Otel layer, and the cold start increased from around 100ms to 300ms.

a-h avatar Oct 31 '23 17:10 a-h

Hi there, We use lambda serverless workloads in Financial Services with tight execution time SLAs, which makes the overhead caused by introducing AWS ADOT or a custom extension layer for OTel SDK or OTel collector unacceptable. We are trying to cut down the cold start time by minimizing layers and using just the SDK without the collector, but looks like we won't be able to cut down the overhead to reach an acceptable level.

If others have run into similar challenges, I'd be interested in learning how you are able to workaround this and still collect distributed traces for such workloads. Thanks

silpamittapalli avatar Apr 02 '24 02:04 silpamittapalli

@silpamittapalli the BaseLime folks have already tried to strip this down as much as possible and package as two dependencies:

It still has dependencies on these libraries though: But worth looking at. https://github.com/baselime/node-opentelemetry/blob/b3331d5040bf35ca633c3634c186a2a5304a201d/package.json#L61-L68

I think a full re-write is in order. Should be a concise Js library optimized for ESM bundling.

sam-goodwin avatar Apr 02 '24 02:04 sam-goodwin

Thanks for the shout-out @sam-goodwin

We can make it smaller but opted not to make some changes we knew we could not upstream to keep things maintainable. As it stands our OTEL setup, including the extension (so we have 0 runtime latency overhead) adds around 180ms of coldstart.

I think its possible to get sub 100ms coldstarts whilst still being based on OpenTelemetry.

There are a few dependencies that can be patched or cut that would not change behavior too much for most use cases and the I'm sure some other bits could be slimmed down a bit

@silpamittapalli if you want to chat through your use case I'd be happy to help with this :)

On doing a complete optimized rewrite - it's easy to underestimate how much work has gone into OTEL and how much it provides. It is a general solution though so not optimized for lambda or other environments that prioritize a quick startup.

Here is our bundle - it's easy to see how much we have done vs how much we rely on the work by the Opentelemetry team.

image

There are some quick wins in there like semver could be replaced with something purpose built and just a few kb and semantic attributes could be tree shaken better. I suspect resources and the resource detectors can also be improved too. but then the rest will be quite hard

Ankcorn avatar Apr 02 '24 08:04 Ankcorn

Has any memory profiling been done for this lambda layer? Any recco with Node v12, v16, v18 or v20? How much is the memory overhead for using this layer?

bhaskarbanerjee avatar Apr 02 '24 15:04 bhaskarbanerjee

Thank you @sam-goodwin @Ankcorn. @bhaskarbanerjee from my team tried out Baselime but we haven't had any success with it yet which is probably bcoz it is customized for their proprietary software. We are trying out few other approaches 1) manual instrumentation to eliminate layers altogether and 2) minimizing SDK and/or layer by stripping unused code/dependencies

silpamittapalli avatar Apr 04 '24 01:04 silpamittapalli

Has anyone here used protobuf/http exporter and compared the performance with that of grpc exporter? Both for Lambda cold start time and response time?

Ref https://github.com/open-telemetry/opentelemetry-lambda/blob/main/nodejs/packages/layer/src/wrapper.ts#L24 import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto' seems to be very fast but if we do import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc' that seems to be taking atleast 100ms more. Seeking your advice.

bhaskarbanerjee avatar Apr 08 '24 02:04 bhaskarbanerjee

This is a blocking issue for us to use opentelemetry in lambda

stevemao avatar May 21 '24 01:05 stevemao