opentelemetry-erlang icon indicating copy to clipboard operation
opentelemetry-erlang copied to clipboard

Performance issue / holding reference to binaries

Open elvanja opened this issue 1 year ago • 40 comments

Hi,

A Phoenix app I am working on is experiencing binaries related memory leak. It is related to https://github.com/quantum-elixir triggered jobs, but after some investigation found that it is actually open telemetry related process that holds on to those binaries. Not sure if this is a Quantum or Telemetry issue though (or our usage!).

Binaries memory leak with quantum and (open)telemetry contains all the details.

Any ideas on what might be the cause?

elvanja avatar Sep 12 '22 07:09 elvanja

Thanks, haven't had time to do a detailed dive into your post but a quick look are you seeing the issue in h2_connection? If so, maybe you can switch to using http_protobuf instead of grpc for exporting? If it is the http2 connection causing the leak I'll get it fixed, but switching will give immediately relief if that is the issue and you'll be able to know if that is the issue for sure.

tsloughter avatar Sep 12 '22 11:09 tsloughter

Unfortunately it is not possible to move to http_protobuf at this time. What I did find out in the mean time is that another project had similar leak issue but without Quantum scheduled involved. What they did in the end was a bit of a brute force approach. They implemented a watchdog which killed h2_stream processes if there were too many (an arbitrary number). That apparently helped reduce the memory usage and was safe for them to handle (loosing some traces was OK). I will see if there is anything I can do about protobuf. Thank for looking into this.

elvanja avatar Sep 12 '22 14:09 elvanja

Ok, this issue sounds familiar actually, there may already be a fix for it I can merge to chatterbox and release a new version. Will get back to you.

tsloughter avatar Sep 12 '22 17:09 tsloughter

See https://github.com/joedevivo/chatterbox/issues/120

tsloughter avatar Sep 12 '22 22:09 tsloughter

@elvanja I've released grpcbox 0.16.0 and chatterbox 0.13.0. Are you are able to bump to those versions and see if anything improves?

tsloughter avatar Sep 13 '22 23:09 tsloughter

Will do, stay tuned for results. The process will take at least a few days to determine if this helps.

elvanja avatar Sep 14 '22 06:09 elvanja

Update: app version with grpcbox 0.16.0 and chatterbox 0.13.0 was in production for 5 days now, and I am still seeing the same pattern of binaries being held. I upgraded only those two dependencies, all the other related dependencies stayed the same. Do you feel it would make sense to try with all related dependencies upgraded before giving :http_protobuf a chance? I confirmed in the mean time that I can deploy :http_protobuf as well. A bit more work, but it is doable.

Here's the list of those related dependencies which use telemetry (removed the others to reduce the noise):

app
├── broadway ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4.3 or ~> 1.0 (Hex package)
├── chatterbox ~> 0.13 (Hex package)
├── ecto ~> 3.6 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── ecto_sql ~> 3.6 (Hex package)
│   ├── db_connection ~> 2.2 (Hex package)
│   │   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4.0 (Hex package)
├── ex_aws ~> 2.2 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── finch ~> 0.12 (Hex package) *override*
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── grpcbox ~> 0.16 (Hex package)
│   ├── chatterbox ~>0.13.0 (Hex package)
│   └── gproc ~>0.8.0 (Hex package)
├── mentat ~> 0.7.1 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── opentelemetry ~> 1.0 (Hex package)
│   └── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_ecto ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_process_propagator ~> 0.1.0 (Hex package)
│   │   └── opentelemetry_api ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── opentelemetry_exporter ~> 1.0 (Hex package)
│   ├── grpcbox >= 0.0.0 (Hex package)
│   ├── opentelemetry ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_phoenix ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_telemetry ~> 1.0 (Hex package)
│   │   ├── opentelemetry_api ~> 1.0 (Hex package)
│   │   ├── telemetry ~> 1.0 (Hex package)
│   │   └── telemetry_registry ~> 0.3.0 (Hex package)
│   │       └── telemetry ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── phoenix ~> 1.5.8 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── plug ~> 1.11.0 (Hex package) *override*
│   └── telemetry ~> 0.4 (Hex package)
├── plug_cowboy ~> 2.5 (Hex package)
│   ├── cowboy_telemetry ~> 0.3 (Hex package)
│   │   └── telemetry ~> 0.4 (Hex package)
├── prom_ex ~> 1.4.0 (Hex package)
│   ├── telemetry >= 0.4.0 (Hex package)
│   ├── telemetry_metrics ~> 0.6.0 (Hex package)
│   ├── telemetry_metrics_prometheus_core ~> 1.0.1 (Hex package)
│   │   ├── telemetry ~> 0.4 (Hex package)
│   │   └── telemetry_metrics ~> 0.6 (Hex package)
│   └── telemetry_poller ~> 0.5.1 (Hex package)
├── quantum ~> 3.0 (Hex package)
│   └── telemetry ~> 0.4.3 or ~> 1.0.0 (Hex package)
├── telemetry ~> 1.1 (Hex package) *override*
├── telemetry_metrics ~> 0.6 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── telemetry_poller ~> 0.5 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── tesla ~> 1.4.3 (Hex package) *override*
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)

If it helps, here are the details for the largest one:

%{
  binaries_size: 1226523189,
  memory: 239552,
  pid: #PID<0.5407.0>,
  process_info: [
    current_function: {:gen_statem, :loop_receive, 3},
    initial_call: {:proc_lib, :init_p, 5},
    status: :waiting,
    message_queue_len: 0,
    links: [#PID<0.16047.24>, #PID<0.24620.2>, #PID<0.29652.25>,
     #PID<0.31253.1>, #PID<0.32115.25>, #PID<0.32508.24>, #Port<0.121>,
     #PID<0.32245.48>, #PID<0.31690.48>, #PID<0.31829.70>, #PID<0.31763.49>,
     #PID<0.31675.24>, #PID<0.30646.48>, #PID<0.30675.69>, #PID<0.30813.69>,
     #PID<0.30898.70>, #PID<0.30721.49>, #PID<0.30649.24>, #PID<0.29999.24>,
     #PID<0.30020.91>, #PID<0.30083.25>, #PID<0.30000.49>, #PID<0.29777.25>,
     #PID<0.29987.70>, #PID<0.29750.92>, #PID<0.27522.24>, #PID<0.28244.24>,
     #PID<0.29183.24>, #PID<0.29457.25>, #PID<0.29525.92>, #PID<0.29493.48>,
     #PID<0.29415.49>, #PID<0.29246.91>, #PID<0.28777.49>, #PID<0.29171.25>,
     #PID<0.29067.69>, #PID<0.28591.49>, #PID<0.28598.49>, #PID<0.28079.1>,
     #PID<0.28124.24>, #PID<0.27628.70>, ...],
    dictionary: [
      "$ancestors": [#PID<0.4553.0>, #PID<0.4547.0>, :otel_batch_processor,
       :opentelemetry_sup, #PID<0.4498.0>],
      "$initial_call": {:h2_connection, :init, 1}
    ],
    trap_exit: false,
    error_handler: :error_handler,
    priority: :normal,
    group_leader: #PID<0.4497.0>,
    total_heap_size: 28690,
    heap_size: 28690,
    stack_size: 11,
    reductions: 568170927,
    garbage_collection: [
      max_heap_size: %{error_logger: true, kill: true, size: 0},
      min_bin_vheap_size: 46422,
      min_heap_size: 233,
      fullsweep_after: 65535,
      minor_gcs: 0
    ],
    suspending: []
  ]
}

elvanja avatar Sep 19 '22 11:09 elvanja

I'd go with using http_protobuf if that can work for you.

There should be a simple fix here, it is just wrong that those stream processes are still alive, something isn't properly closing.

tsloughter avatar Sep 20 '22 21:09 tsloughter

I was hoping I'd find it as simple as checking the number of links a connection process still has after running some unary requests so I could write a test in grpcbox but that doesn't appear to be the case.

tsloughter avatar Sep 20 '22 22:09 tsloughter

Are you sending to the Otel collector? If so, what version? I'm now wondering if it could be the server is not closing their end and that's why those processes are alive.

I'm testing against 0.60.0 of the collector-contrib docker image.

tsloughter avatar Sep 20 '22 22:09 tsloughter

We are actually sending to Honeycomb, via Refinery as proxy. Related links: https://github.com/honeycombio/refinery/tree/v1.17.0 and https://docs.honeycomb.io/manage-data-volume/refinery/.

And, I have started the switch to protobuf, but again it will take a bit of time to actually release and see if it solved the problem. Stay tuned.

elvanja avatar Sep 22 '22 13:09 elvanja

Update: so changing the protocols solved the problem. Apologies for a bit late reply, but it took a bit of time to make sure we are in the clear. Not sure why :grpc flow causes this, but at least in our case, the switch to :http_protobuf saved the day. Thank you for your time @tsloughter! 🙇🏼 Not sure if you want to keep the issue open because of :grpc though.

elvanja avatar Oct 11 '22 08:10 elvanja

@elvanja good to hear.

We should keep the issue open. I know it is a bug in the grpc client and need to get it fixed.

tsloughter avatar Oct 11 '22 10:10 tsloughter

Hey, we are also seeing the same issue. Unfortunately setting protocol to :http_protobuf did not solve the issue We tested the app disabling the exporter and the issue goes away. Any tips on how we can debug this further to provide more details or context?

erlang 23.3.4
elixir 1.11.4-otp-23

app
├── ecto_sql ~> 3.4 (Hex package)
│   ├── db_connection ~> 2.2 (Hex package)
│   │   ├── connection ~> 1.0 (Hex package)
│   │   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   ├── ecto ~> 3.7.0 (Hex package)
│   ├── postgrex ~> 0.15.0 or ~> 0.16.0 or ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4.0 or ~> 1.0 (Hex package)
├── ex_aws ~> 2.0 (Hex package)
│   ├── hackney ~> 1.16 (Hex package)
│   │   ├── certifi ~>2.9.0 (Hex package)
│   │   ├── idna ~>6.1.0 (Hex package)
│   │   │   └── unicode_util_compat ~>0.7.0 (Hex package)
│   │   ├── metrics ~>1.0.0 (Hex package)
│   │   ├── mimerl ~>1.1 (Hex package)
│   │   ├── parse_trans 3.3.1 (Hex package)
│   │   ├── ssl_verify_fun ~>1.1.0 (Hex package)
│   │   └── unicode_util_compat ~>0.7.0 (Hex package)
│   ├── jason ~> 1.1 (Hex package)
│   ├── mime ~> 1.2 or ~> 2.0 (Hex package)
│   └── telemetry ~> 0.4.3 or ~> 1.0 (Hex package)
├── ex_aws_sns ~> 2.0 (Hex package)
│   └── ex_aws ~> 2.0 (Hex package)
├── opentelemetry ~> 1.0 (Hex package)
│   └── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_ecto ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_process_propagator ~> 0.1.0 (Hex package)
│   │   └── opentelemetry_api ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── opentelemetry_exporter ~> 1.0 (Hex package)
│   ├── grpcbox >= 0.0.0 (Hex package)
│   │   ├── acceptor_pool ~>1.0.0 (Hex package)
│   │   ├── chatterbox ~>0.11.0 (Hex package)
│   │   │   └── hpack ~>0.2.3 (Hex package)
│   │   ├── ctx ~>0.6.0 (Hex package)
│   │   └── gproc ~>0.8.0 (Hex package)
│   ├── opentelemetry ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   └── tls_certificate_check ~> 1.11 (Hex package)
│       └── ssl_verify_fun 1.1.6 (Hex package)
├── opentelemetry_logger_metadata ~> 0.1.0-rc (Hex package)
│   └── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_phoenix ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_telemetry ~> 1.0 (Hex package)
│   │   ├── opentelemetry_api ~> 1.0 (Hex package)
│   │   ├── telemetry ~> 1.0 (Hex package)
│   │   └── telemetry_registry ~> 0.3.0 (Hex package)
│   │       └── telemetry ~> 1.0 (Hex package)
│   ├── plug >= 1.11.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── opentelemetry_tesla ~> 1.3.2-rc.4 (Hex package)
│   ├── opentelemetry_api ~> 1.0.0 (Hex package)
│   ├── opentelemetry_telemetry ~> 1.0.0-beta (Hex package)
│   ├── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   └── tesla ~> 1.4 (Hex package)
├── phoenix ~> 1.5.9 (Hex package)
│   ├── jason ~> 1.0 (Hex package)
│   ├── phoenix_html ~> 2.13 or ~> 3.0 (Hex package)
│   │   └── plug ~> 1.5 (Hex package)
│   ├── phoenix_pubsub ~> 2.0 (Hex package)
│   ├── plug ~> 1.10 (Hex package)
│   ├── plug_cowboy ~> 1.0 or ~> 2.2 (Hex package)
│   ├── plug_crypto ~> 1.1.2 or ~> 1.2 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── phoenix_ecto ~> 4.1 (Hex package)
│   ├── ecto ~> 3.3 (Hex package)
│   ├── phoenix_html ~> 2.14.2 or ~> 3.0 (Hex package)
│   └── plug ~> 1.9 (Hex package)
├── telemetry_metrics ~> 0.6 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── telemetry_poller ~> 1.0 (Hex package)
│   └── telemetry ~> 1.0 (Hex package)
├── tesla ~> 1.4 (Hex package)
│   ├── hackney ~> 1.6 (Hex package)
│   ├── jason >= 1.0.0 (Hex package)
│   ├── mime ~> 1.0 or ~> 2.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)

davan-nl avatar Oct 11 '22 22:10 davan-nl

It is most likely not the same issue if it happens even with http_protobuf. Can you open a new issue with any more details you can provide?

tsloughter avatar Oct 11 '22 22:10 tsloughter

Sorry for this question but is this the correct way to configure the exporter:

    otel_batch_processor: %{
      exporter:
        {:opentelemetry_exporter,
         %{endpoints: [{:http, otel_collector_host, otel_collector_port, []}], protocol: :http_protobuf}}
    }

For some reason we are seeing grpcbox_client:recv_end/2 being called (being one of the most memory hungry processes) but the only library in our app that has that as a dependency is the exporter. Is it used for something else even if the protocol is http_protobuf?

davan-nl avatar Oct 12 '22 15:10 davan-nl

@davan-nl technically it can be configured that way but since it is so deeply nested and hard to be sure its correct we have a simpler way to configure the processor and exporter:

# config/runtime.exs
config :opentelemetry,
  span_processor: :batch,
  traces_exporter: :otlp

config :opentelemetry_exporter,
  otlp_protocol: :http_protobuf,
  otlp_endpoint: "http://localhost:4318"

These are also the defaults

tsloughter avatar Oct 12 '22 16:10 tsloughter

and no, grpc isn't used for anything else, so you should not see any use of grpcbox_client when the processor is configured to use the http exporter.

tsloughter avatar Oct 12 '22 16:10 tsloughter

Thanks for the prompt responses @tsloughter. I am wondering whats the priority to get grpc working. As you mention using http_protobuf fixes the leak but its not really an option for us. Wish I could help but I dont know erlang 😅

davan-nl avatar Oct 12 '22 19:10 davan-nl

@davan-nl are you sending to the otel collector or a vendor directly?

tsloughter avatar Oct 12 '22 20:10 tsloughter

I try doing a simple example against the collector and I'm not seeing the ever increasing list of links like you show in the process info for the grpc connection process (meaning a stream leak).

tsloughter avatar Oct 12 '22 21:10 tsloughter

We use instana which has a local agent running. Can try investigating if using the otel collector in-between can help. There is a clear difference in memory usage when we switched from grpc to http_protobuf.

davan-nl avatar Oct 12 '22 22:10 davan-nl

Ok, yea, that'd be great if you can reproduce with otel collector. I'm running an app locally publishing to the collector and not reproducing. It could be something with how the vendor agents close, or don't close, streams. Still likely a bug on our end (not ending the stream when we should) but it would help narrow down where the bug is.

tsloughter avatar Oct 12 '22 22:10 tsloughter

Just wanted to add this was the output from recon :recon.bin_leak(5)

[
  {#PID<0.30485.5>, -45019,
   [
     current_function: {:grpcbox_client, :recv_end, 2},
     initial_call: {:erlang, :apply, 2}
   ]},
  {#PID<0.2685.0>, -1120,
   [
     :ssl_manager,
     {:current_function, {:gen_server, :loop, 7}},
     {:initial_call, {:proc_lib, :init_p, 5}}
   ]},
  {#PID<0.3093.0>, -622,
   [
     SomeService,
     {:current_function, {:gen, :do_call, 4}},
     {:initial_call, {:proc_lib, :init_p, 5}}
   ]},
  {#PID<0.2711.0>, -537,
   [
     Logger,
     {:current_function, {:re, :postprocess, 5}},
     {:initial_call, {:proc_lib, :init_p, 5}}
   ]},
  {#PID<0.28540.5>, -407,
   [
     current_function: {:gen_statem, :loop_receive, 3},
     initial_call: {:proc_lib, :init_p, 5}
   ]}
]

davan-nl avatar Oct 12 '22 22:10 davan-nl

Thanks, recv_end may, at least partially, mean my guess is right that it gets into an end state but for some reason never finishes the ending.

tsloughter avatar Oct 12 '22 22:10 tsloughter

Just wanted to give 1 last update We tried Elixir -> Otel Collector -> Instana all over gRPC and everything worked smoothly. If Instana was mishandling things on their end we would see the collector leak too no? Just more details if you suspect the grpc_box lib.

davan-nl avatar Oct 13 '22 21:10 davan-nl

@davan-nl good to know. And yes, it is most likely and issue in grpcbox, but not 100%. It could be both Instana and the Collector use Go's grpc lib and so has the same quirk, but more likely is that something about talking to a grpc server over the internet (could be an intermediary box causing this) results in grpcbox not closing the stream.

But now I know I should be able to reproduce by signing up for a service and giving it a try.

tsloughter avatar Oct 13 '22 22:10 tsloughter

I've got an example exporting to honeycomb, its creating a span every 500 ms (its just a free account so trying not to be cutoff :)) and I'm still unable to see his happening. Going to keep it running and hopefully eventually reproduces.

tsloughter avatar Oct 14 '22 11:10 tsloughter

Oh, oops, forgot for honeycomb it was first going through refinery when @elvanja was seeing this... I guess I need to try running with that to be able to reproduce.

tsloughter avatar Oct 14 '22 11:10 tsloughter

Hi, any update with the issue above? we also use Instana and seeing the memory leak issue

pprogrammingg avatar Oct 25 '22 17:10 pprogrammingg