opentelemetry-erlang icon indicating copy to clipboard operation
opentelemetry-erlang copied to clipboard

add span monitor process to optionally monitor the spans in a process

Open tsloughter opened this issue 5 years ago • 13 comments

A span can optionally set monitor to true at the time it is created. This results in a monitor on the process and if that process exits for any reason all spans started in that process will be ended.

tsloughter avatar Jul 09 '20 14:07 tsloughter

This is partly wrong. I realized right before opening the PR that this shouldn't be done for the current process when start_inactive_span is used. This has to be done for the process it becomes active in.

Not sure yet how I want to fix this.

tsloughter avatar Jul 09 '20 14:07 tsloughter

Two questions I have are:

  • Should this be only per-span. Right now it will end all spans in the process no matter which span includes the monitor => true option. I could instead have the monitor only end spans that explicitly include the monitor option and leave the others open.
  • Maybe if the solution of having it end all spans (as it is in this PR currently) in the monitored process is agreed on it should then be a call like ot_tracer:monitor_spans instead of being an option on span start?

tsloughter avatar Jul 09 '20 14:07 tsloughter

It makes sense that you'd want to close all spans in the monitored process since the process dying is fatal to all of them.

bryannaegele avatar Jul 09 '20 16:07 bryannaegele

@bryannaegele my thought too. But people do/want weird things.

I should probably change it to be a function called to monitor all spans in a process instead of adding it to the span start opts then...

tsloughter avatar Jul 09 '20 17:07 tsloughter

I think I'd want to add this to my tracer config to cover every span it starts. I'd also want a way to turn the exit details into extra span attributes, though the latter'd also apply to any with_child_span construct taking a function argument and catching exceptions.

garthk avatar Jul 09 '20 23:07 garthk

@tsloughter is there any remaining work left on this PR?

I'm seeing this manifest when using Task.async_stream

Screen Shot 2020-08-31 at 11 22 58 AM

One of the tasks times out with no child span reported

davydog187 avatar Aug 31 '20 15:08 davydog187

@davydog187 oh, hm, I think I just wanted to offer a function to start monitoring in a process (instead of just being an option to start_span).

Maybe a config value so it could be turned on globally but we probably will want a pool of monitoring processes before supporting that so it'll come in a separate PR.

tsloughter avatar Sep 01 '20 23:09 tsloughter

So I want to finish this up and can't decide what the best interface is.

An issue is how to handle inactive spans. I think this means that when attaching a context the active span in that context has to have its pid updated in the ets table...

tsloughter avatar Sep 25 '20 16:09 tsloughter

I implemented (by peeking at this PR) this in a project I'm working on. In the end I went with the following interface: OpenTelemetry.Span.monitor/1. This is nice because it allows us to do something like:

OpenTelemetry.Tracer.with_span "my_span" do
  OpenTelemetry.Span.monitor(OpenTelemetry.Tracer.current_span_ctx())
end

Here's the code I ended up with, for posterity:

defmodule OT.Monitor do                                                                        
  use GenServer                                                                                
                                                                                               
  def start_link(_arg) do                                                                      
    GenServer.start_link(__MODULE__, nil, name: __MODULE__)                                    
  end                                                                                          
                                                                                               
  def init(nil) do                                                                             
    _table_id = :ets.new(__MODULE__, [:bag, :public, {:write_concurrency, true}, :named_table])
    {:ok, nil}                                                                                 
  end                                                                                          
                                                                                               
  def handle_call({:monitor, pid}, _from, state) do                                            
    Process.monitor(pid)                                                                       
    {:reply, :ok, state}                                                                       
  end                                                                                          
                                                                                               
  def handle_info({:DOWN, _ref, :process, pid, :normal}, state) do                             
    :ets.take(__MODULE__, pid)                                                                 
    |> Enum.each(fn {_pid, ctx} ->                                                             
      _span_ctx = OpenTelemetry.Tracer.set_current_span(ctx)                                   
      _ = OpenTelemetry.Tracer.end_span()                                                      
    end)                                                                                       
                                                                                               
    {:noreply, state}                                                                          
  end                                                                                          
                                                                                               
  def handle_info({:DOWN, _ref, :process, pid, {:shutdown, _}}, state) do                      
    :ets.take(__MODULE__, pid)                                                                 
    |> Enum.each(fn {_pid, ctx} ->                                                             
      _span_ctx = OpenTelemetry.Tracer.set_current_span(ctx)                                   
      _ = OpenTelemetry.Tracer.end_span()                                                      
    end)                                                                                       
                                                                                               
    {:noreply, state}                                                                          
  end                                                                                          
                                                                                               
  def handle_info({:DOWN, _ref, :process, pid, reason}, state) do                              
    :ets.take(__MODULE__, pid)                                                                 
    |> Enum.each(fn {_pid, ctx} ->                                                             
      _span_ctx = OpenTelemetry.Tracer.set_current_span(ctx)                                   
      _ = OpenTelemetry.Tracer.add_event("Process died", [{"reason", inspect(reason)}])        
      _ = OpenTelemetry.Tracer.end_span()                                                      
    end)                                                                                       
                                                                                               
    {:noreply, state}                                                                          
  end                                                                                          
                                                                                               
  def monitor(span_ctx) do                                                                     
    if Application.fetch_env!(:opentelemetry, :enabled) do                                     
      # monitor first, because the monitor is necessary to clean the ets table.                
      :ok = GenServer.call(__MODULE__, {:monitor, self()})                                     
      true = :ets.insert(__MODULE__, {self(), span_ctx})                                       
    end                                                                                        
  end                                                                                          
end                                                                                            

derekkraan avatar Oct 28 '21 12:10 derekkraan

I think this functionality is pretty critical. Personally I would want to have this for as many spans as possible, and definitely for active spans. So that means I would prefer not to set it as an option for every span, but rather as a config value, but I'll take any version of this feature I can get 🙂

hkrutzer avatar May 03 '22 12:05 hkrutzer

If we either make span processors not a pain or extend them to allow additional functionality attached to them then I think we could easily allow it to be configured for all spans instead of an option per-span.

tsloughter avatar May 03 '22 16:05 tsloughter

@tsloughter i was thinking if spans actually shoudn't live in a external process, that monitor the process starting the spans, in case of the process dying or raises, it closes all remaining open spans with error status. I know it would be a big change on otel sdk architecture, should i start a discussion about that?

fcevado avatar Sep 02 '22 13:09 fcevado

@fcevado that could work. There are other tracing libs that do just that, trying to remember which one... Ah, it is Tapper. And I think New Relic or AppSignal actually use (or used) a separate process for each trace -- which I wouldn't think would scale well, at least compared to Tapper or our use of ETS.

What I'd hope to be able to experiment with is alternate SDKs. Technically it should be possible for an SDK to be created that does make all span operations messages to another process. No user code should have to change, just which SDK they include in their project.

tsloughter avatar Sep 02 '22 16:09 tsloughter

Closing for https://github.com/open-telemetry/opentelemetry-erlang/pull/602

tsloughter avatar Aug 10 '23 11:08 tsloughter