opentelemetry-erlang-contrib icon indicating copy to clipboard operation
opentelemetry-erlang-contrib copied to clipboard

Record exception for eto_query error

Open spencerdcarlson opened this issue 2 years ago • 36 comments

Record the exception for ecto query errors.

spencerdcarlson avatar Jun 04 '22 00:06 spencerdcarlson

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: spencerdcarlson / name: Spencer (2fcffc129f8a909a87657448f8d87c9adae4cdb0, 9fb081d3bccc70995df7f219542b26e195f22eb9)

@spencerdcarlson are all error results in the telemetry event exceptions?

bryannaegele avatar Jun 06 '22 16:06 bryannaegele

are all error results in the telemetry event exceptions?

Passing the error through Exception.normalize/2 will take care of any cases where it is not an exception.

iex(11)> Exception.normalize(:error, %UndefinedFunctionError{})
%UndefinedFunctionError{
  arity: nil,
  function: nil,
  message: nil,
  module: nil,
  reason: nil
}
iex(12)> Exception.normalize(:error, nil)
%ErlangError{original: nil}
iex(13)> Exception.normalize(:error, "some string")
%ErlangError{original: "some string"}

spencerdcarlson avatar Jun 06 '22 16:06 spencerdcarlson

Yes, but your code isn't accounting for if it is an exception or not, it just records an exception no matter what in the current implementation. How can we distinguish between those cases?

bryannaegele avatar Jun 06 '22 16:06 bryannaegele

Yes, but your code isn't accounting for if it is an exception or not, it just records an exception no matter what in the current implementation. How can we distinguish between those cases?

I guess I don't follow your question. My understanding is that this will only take place when ecto tells us that that query returned an error {:error, error} — Wouldn't we always want to capture the error details regardless of if it is a formal Exception or not?

Do you think we need some extra filtering like OpentelemetryEcto.format_error/1 to ensure that just because ecto says the query returned an error we only capture error details if it is a formal exception?

spencerdcarlson avatar Jun 06 '22 16:06 spencerdcarlson

Yes, but your code isn't accounting for if it is an exception or not, it just records an exception no matter what in the current implementation. How can we distinguish between those cases?

Ok, I think I follow. Is the concern that if ecto returns. {:error, "Something bad happened"} then we are essentially type casting that string into an ErlangError. When it isn't actually an exception? My thought is that we would still want to capture any details we can about the error and propagate that as an OpenTel exception. Maybe it is a bit more clean to create a custom UnknownEctoException to cast it to instead of an ErlangError.

spencerdcarlson avatar Jun 06 '22 16:06 spencerdcarlson

Yeah, the point was we can't assume everything is an exception just because the adapter returned an error and we can only record something as an exception if it actually is. We already set an error and capture the error message so I think any change here would have to be additive only for real exceptions. Since it's up to the adapter to emit and populate this event, I'm not sure if any of them are going to return an exception struct as the second value in the error tuple.

bryannaegele avatar Jun 06 '22 20:06 bryannaegele

Right, if it isn't an exception then the error on the status should be enough.

tsloughter avatar Jun 06 '22 20:06 tsloughter

I'm not sure if any of them are going to return an exception struct as the second value in the error tuple.

You can see from the augmented test that it is already returning an %Postgrex.Error{} as the second tuple. Also we are currently throwing away the error message if it is not an exception. It seems like we could preserve it if it is any valid UTF8 string. What do you think about something like this?

...
  case query_result do
      {:error, %{__exception__: true} = exception} ->
        OpenTelemetry.Span.record_exception(s, Exception.normalize(:error, exception))
        OpenTelemetry.Span.set_status(s, OpenTelemetry.status(:error, Exception.message(exception)))

      {:error, error} ->
        status = if utf8?(error), do: error, else: ""
        OpenTelemetry.Span.set_status(s, OpenTelemetry.status(:error, status))

      {:ok, _} ->
        :ok
    end

    OpenTelemetry.Span.end_span(s)

    if parent_context != :undefined do
      OpenTelemetry.Ctx.detach(parent_context)
    end
  end

defp utf8?(<<_::utf8, rest::binary>>), do: utf8?(rest)
defp utf8?(<<>>), do: true
defp utf8?(_), do: false

The reason I am opening this MR is because an exception gets surfaced in data dog and an error message does not. It seems that opentelemetry_phoenix is instrumented to provide those details

spencerdcarlson avatar Jun 06 '22 21:06 spencerdcarlson

@spencerdcarlson to be clear, does that above code that matches out an exception case work for your use-case and you see the error in datadog?

I think it is reasonable to record the exception if it is an exception -- like in your example code -- but it'd be good to know also why datadog isn't surfacing the error in OpenTelemetry.Span.set_status(s, OpenTelemetry.status(:error, format_error(error)))

tsloughter avatar Jun 06 '22 21:06 tsloughter

I figured I'd see what others were doing and Python sqlalchemy is recording errors like this the same way: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/3137c1fa413a2b530300bfed57f71dcc94412b6a/tests/opentelemetry-docker-tests/tests/sqlalchemy_tests/test_sqlite.py#L59

It maybe should be added to the database semantic conventions though.

tsloughter avatar Jun 06 '22 22:06 tsloughter

I assume you are using the otel collector with the datadog exporter?

I dug around in there a bit, has some code that looks specific to figuring out error information from the span status message https://github.com/open-telemetry/opentelemetry-collector-contrib/commit/a95cdc07ea0fe6551604053015916dc680e35937

tsloughter avatar Jun 06 '22 22:06 tsloughter

@spencerdcarlson to be clear, does that above code that matches out an exception case work for your use-case and you see the error in datadog?

Yes, I believe so. The exceptions that are emitted by the opentelemetry_phoenix have great visibility in data dog. The main difference that I notice is the call to OpenTelemetry.Span.record_exception/4. In the email I sent you there are two screenshots of a DB error trace and an HTTP error trace and you can see the difference. I don't want to include the screenshots here since they have some internal code references.

but it'd be good to know also why datadog isn't surfacing the error in OpenTelemetry.Span.set_status(s, OpenTelemetry.status(:error, format_error(error)))

I agree. This seems like a bug that should be opened with either the OTEL Collector or Data Dog

I will update this branch with my proposed solution unless either of you have any objections.

spencerdcarlson avatar Jun 07 '22 02:06 spencerdcarlson

Interesting, so that means it is indeed an exception. It is beside the point I'm just thrown off by Ecto returning exceptions as errors.

Do you have anyone at Datadog to also raise this issue with? I can try to poke around to find someone.

tsloughter avatar Jun 07 '22 12:06 tsloughter

I'm just thrown off by Ecto returning exceptions as errors.

That's the default behavior of Ecto/[Postgrex|MyXQL] Here are a few examples of common exceptions that would be nice to capture:

iex(1)> Repo.all(from f in "foo", select: %{name: f.name})
[debug] QUERY ERROR source="foo" db=0.0ms queue=3.5ms idle=307.2ms
SELECT f0."name" FROM "foo" AS f0 []
** (Postgrex.Error) ERROR 42P01 (undefined_table) relation "foo" does not exist

    query: SELECT f0."name" FROM "foo" AS f0
    (ecto_sql 3.8.2) lib/ecto/adapters/sql.ex:932: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.8.2) lib/ecto/adapters/sql.ex:847: Ecto.Adapters.SQL.execute/6
    (ecto 3.8.3) lib/ecto/repo/queryable.ex:221: Ecto.Repo.Queryable.execute/4
    (ecto 3.8.3) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3

# Hitting the driver directly 

iex(2)> Repo.query("SELECT * FROM foo")
[debug] QUERY ERROR db=2.3ms queue=8.6ms idle=225.2ms
SELECT * FROM foo []
{:error,
 %Postgrex.Error{
   connection_id: 19802,
   message: nil,
   postgres: %{
     code: :undefined_table,
     file: "parse_relation.c",
     line: "1180",
     message: "relation \"foo\" does not exist",
     pg_code: "42P01",
     position: "15",
     routine: "parserOpenTable",
     severity: "ERROR",
     unknown: "ERROR"
   },
   query: "SELECT * FROM foo"
 }}
 
iex(3)> Repo.transaction(fn -> :timer.sleep(200) end, timeout: 20)
[debug] QUERY OK db=1.6ms idle=1390.7ms
begin []
[error] Postgrex.Protocol (#PID<0.616.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.875.0> timed out because it queued and checked out the connection for longer than 20ms

#PID<0.875.0> was at location:

    (stdlib 3.17.2) timer.erl:152: :timer.sleep/1
    (ecto_sql 3.8.2) lib/ecto/adapters/sql.ex:1222: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.4.2) lib/db_connection.ex:1562: DBConnection.run_transaction/4
    (stdlib 3.17.2) erl_eval.erl:685: :erl_eval.do_apply/6
    (elixir 1.13.1) src/elixir.erl:289: :elixir.recur_eval/3
    (elixir 1.13.1) src/elixir.erl:274: :elixir.eval_forms/3
    (iex 1.13.1) lib/iex/evaluator.ex:310: IEx.Evaluator.handle_eval/3
    (iex 1.13.1) lib/iex/evaluator.ex:285: IEx.Evaluator.do_eval/3
    (iex 1.13.1) lib/iex/evaluator.ex:274: IEx.Evaluator.eval/3
    (iex 1.13.1) lib/iex/evaluator.ex:169: IEx.Evaluator.loop/1
    (iex 1.13.1) lib/iex/evaluator.ex:32: IEx.Evaluator.init/4
    (stdlib 3.17.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

[debug] QUERY ERROR db=0.0ms
rollback []
{:error, :rollback}

Do you have anyone at Datadog to also raise this issue with?

I will keep looking into why Datadog or the OTEL collector is not surfacing the error message.

Do you have any reservations about going forward with the proposed code change in the meantime?

Reporting the exception captures a lot of useful data. Here is the full stacktrace from the augmented test:

(opentelemetry_ecto 1.0.0) lib/opentelemetry_ecto.ex:120: OpentelemetryEcto.handle_event/4
    (telemetry 1.1.0) /Users/sdc/code/sdc/opentelemetry-erlang-contrib/instrumentation/opentelemetry_ecto/deps/telemetry/src/telemetry.erl:160: anonymous fn/4 in :telemetry.execute/3
    (stdlib 3.17.2) lists.erl:1342: :lists.foreach/2
    (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:927: Ecto.Adapters.SQL.log/4
    (db_connection 2.4.2) lib/db_connection.ex:1510: DBConnection.log/5
    (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:701: Ecto.Adapters.SQL.execute!/4
    (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:693: Ecto.Adapters.SQL.execute/5
    (ecto 3.7.2) lib/ecto/repo/queryable.ex:219: Ecto.Repo.Queryable.execute/4
    (ecto 3.7.2) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    test/opentelemetry_ecto_test.exs:114: OpentelemetryEctoTest.\"test sets error message on error\"/1
    (ex_unit 1.13.4) lib/ex_unit/runner.ex:500: ExUnit.Runner.exec_test/1
    (stdlib 3.17.2) timer.erl:166: :timer.tc/1
    (ex_unit 1.13.4) lib/ex_unit/runner.ex:451: anonymous fn/4 in ExUnit.Runner.spawn_test_monitor/4

This should help pin point exactly where the query is that is running into issues. test/opentelemetry_ecto_test.exs:114

spencerdcarlson avatar Jun 07 '22 14:06 spencerdcarlson

Not positive yet. But that stacktrace doesn't look right. It is a trace to the telemetry library, not where the query was called from.

tsloughter avatar Jun 07 '22 15:06 tsloughter

👋 Hey, I work at Datadog, I can relay this to the APM team.

@tsloughter if it's not too much to ask for, it would be useful to have a bite-sized explanation of what exactly is wrong with how the Datadog exporter deals with this. Could you share this here or in an issue at the Collector contrib repo? That would be super helpful, thanks!

mx-psi avatar Jun 07 '22 15:06 mx-psi

@mx-psi thanks!

I'm not sure there is an issue with the exporter. I was noting that there is code in the exporter specific to handling the status error message, so it was possible that we might find an explanation there. But I hadn't noticed anything that stands out as a potential cause.

tsloughter avatar Jun 07 '22 16:06 tsloughter

@mx-psi here are some redacted (just being overly cautions) screenshots of how I see the issue.

All the error traces related to opentelemetry_ecto only display error: true. The error status message has the helpful details (e.g. "[ERROR 42703 (undefined_column) column u0.non_existant_field does not exist\n\n query: SELECT u0."non_existant_field" FROM "users" AS u0]") but it is not associated to the trace.

In contrast it appears that most opentelemetry_phoenix error events contain a stacktrace and even an error.message tag. (e.g. "expected at most one result but got 143 in query: from i0 in SelfServiceIntegration.Integration.Schema.Integration, preload: [[:tags, :tray_solution]]")

opentelemetry_ecto Error Trace opentelemetry_ecto-error-trace-redacted

opentelemetry_phoenix Error Trace opentelemetry_phoenix-error-trace-redacted

spencerdcarlson avatar Jun 07 '22 16:06 spencerdcarlson

What do you mean by "but it is not associated to the trace."? That information is somewhere in the Datadog UI but just not shown in the span tags?

tsloughter avatar Jun 07 '22 16:06 tsloughter

What do you mean by "but it is not associated to the trace."? That information is somewhere in the Datadog UI but just not shown in the span tags?

I can not find it anywhere in Datadog. It appears to not be in Datadog, but I don't know exactly where the error status is getting lost.

spencerdcarlson avatar Jun 07 '22 16:06 spencerdcarlson

Not positive yet. But that stacktrace doesn't look right. It is a trace to the telemetry library, not where the query was called from.

This stack trace points directly back to where the original query was executed. Notice the line : test/opentelemetry_ecto_test.exs:114 which is referencing Repo.all(from u in "users", select: u.non_existant_field) from the test.

The reason that the first two lines in the stacktrace have telemetry in it is because after Ecto executes the query, it calls telemetry to emit the event -- so it makes sense that telemetry calls are in the stacktrace.

App -> ecto -> db_connection -> ecto_sql -> telemetry -> App.event_handler

This thread is attacking two distinct issues:

  1. The error status message is getting lost. (@mx-psi is looking into it)
  2. Not associating exceptions to a given trace.

Can we agree to fix 2. by reporting the exception?

spencerdcarlson avatar Jun 07 '22 18:06 spencerdcarlson

@spencerdcarlson what I see is the stacktrace starts with (opentelemetry_ecto 1.0.0) lib/opentelemetry_ecto.ex:120: OpentelemetryEcto.handle_event/4 because it isn't a stacktrace to where the exception happened, but to where the library is recording the exception.

I didn't see where the stacktrace was coming from since I don't see it in the PostgrexError, so I looked up the format function and it says it will use the process info to get the current stacktrace: https://hexdocs.pm/elixir/1.12/Exception.html#format_stacktrace/1

That would explain handle_event being the top function of the stack.

You have other examples that are exceptions, like the one that times out in the transaction and has a stacktrace to the call that was running when it timed out.

If all the information is available in the Status then I don't think we should add an exception, unless there is an exception that must be caught and thus also has a stacktrace.

tsloughter avatar Jun 07 '22 18:06 tsloughter

Note that the Phoenix integration only records an exception if the exception event is triggered, which includes a stacktrace.

tsloughter avatar Jun 07 '22 18:06 tsloughter

There's a few more nuances here that make this difficult that we've discussed since the beginning of this lib.

  1. Are we expecting this library to eventually translate and implement deciphering error vs returned exception for all adapters in the ecosystem? We'd really like the other libraries and drivers to have their own instrumentation, e.g. Cowboy is instrumented separately from Phoenix. a. Each of the built in adapters have different values on their exceptions. Postgres 1, Postgres 2, MyXQL, TDS has none, SQLite has none, etc.
  2. These telemetry fields events are: a. Completely optional b. Left to each adapter to implement

I do think there is a clear issue with DD if it isn't marking these spans as errored which, while frustrating, isn't going to influence any decision to mark everything an exception without a holistic review for all of Ecto's use cases. They show up fine in Lightstep and other tooling, so that seems like a bug on their end.

bryannaegele avatar Jun 07 '22 19:06 bryannaegele

@tsloughter sorry, where is Exception.format_stacktrace/1 being called?

I believe the stacktrace is coming %Postgrex.Error{} and I do not think that the call to Exception.normalize/2 is even needed. if %{__exception__: true} = exception then Exception.normalize/2 just passes through the exception. This was protection for a possible scenario in my fist commit that we can pull out.

I think the handle_event is in the stack trace because telemetry uses dynamic dispatching to emit the metric so the call to record the metric happens inside ecto_sql but in the stacktrace of the process that originally executed the query.

ecto_sql v3.7.2 emit event via telemetry Telemetry - emit event via dynamic dispatching.

ecto_sql v3.7.3 (current mix.lock version for opentelemetry_ecto) its Line L927 which is in the stacktrace as (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:927: Ecto.Adapters.SQL.log/4

spencerdcarlson avatar Jun 07 '22 19:06 spencerdcarlson

Note that the Phoenix integration only records an exception if the exception event is triggered, which includes a stacktrace.

Right. And for reference, the spec is what we have to work from. record_exception is just a convenience for add_event with some formatting niceties. In the case of an exception, you have to use one way to record the event or the other, but not both.

bryannaegele avatar Jun 07 '22 19:06 bryannaegele

@spencerdcarlson I don't see stacktrace in the postgrex error struct. format_stacktrace is called by record_exception. handle_event line 120 is the top of the stack because it is getting the current stacktrace where you are calling record_exception and not a stacktrace to where the error occured.

tsloughter avatar Jun 07 '22 19:06 tsloughter

@tsloughter thanks for the catch! That explains why the stacktrace originates from handle_event.

spencerdcarlson avatar Jun 07 '22 19:06 spencerdcarlson

Hey there 👋🏻 I work on APM, and the datadogexporter. It is not entirely clear to me what the issue is from reading this. There is a lot of information sprinkled around the responses which is specific to another project so it's hard for me to make sense of it.

Can you please help me understand the specific issue that exists inside the datadogexporter? Perhaps we can open a new issue specific to that in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues ?

It looks to me that if the status code of the trace is set to error, and if the error.msg tag is not specifically set by the user, it will be set to the status message here.

In order to help, it'd be helpful to me if I could understand the issue specifically related to the context of the exporter.

gbbr avatar Jun 08 '22 11:06 gbbr

@tsloughter I just looked and found that ecto_sql 3.8.0 now exports the stacktrace. ~~I don't think they have implement the functionality of exporting the stacktrace in either postgrex or db_connection. (my guess it that it will happen in db_connection in the future)~~ It looks like setting stacktrace: true gives us what we are looking for.

mix deps.update ecto_sql && \
  mix test test/opentelemetry_ecto_test.exs:110

With ^this I can see that :stacktrace is now in the metadata, but it is still coming in as nil

I added some backwards compatible changes to support using the proper stracktrace, or default to generating it from process info. If the :stacktrace key is not in the meta data then we will get the stacktrace generated from the process info with the two extra lines tacked on. Once ecto fully supports propagating the stacktrace it will be more precise. Thoughts?

spencerdcarlson avatar Jun 10 '22 16:06 spencerdcarlson

@tsloughter || @bryannaegele any update on this? Could you please help me understand why you might be against adding this MR? With the stacktrace coming from the emitted event this seems to be extremely similar to how the phoenix events are handled.

spencerdcarlson avatar Jun 22 '22 22:06 spencerdcarlson

So the stacktrace is included if the user adds a :stacktrace option on each query?

Maybe if the PR took into account whether it was an "accurate" stacktrace somehow and only included it as an exception in that case I could see it being added. Would that be possible?

Since the error status can not have a stacktrace I see the desire for having an exception event, but if the stacktrace isn't guaranteed to be to the right place then we should only record the exception when it is.

tsloughter avatar Jun 22 '22 23:06 tsloughter

Actually, is the stacktrace actually useful in this scenario? Particularly if it is a failure of the query I wouldn't think so. It would simply point you to the query, which is already given in the attributes.

The information in red here https://user-images.githubusercontent.com/1206303/172432438-99dacd7a-5dec-478c-8ef2-738a7c72395a.png should be what the status is already (minus the query since that is a separate attribute of the span.

tsloughter avatar Jun 22 '22 23:06 tsloughter

Sorry if this is frustrating :). We don't want to take lightly adding a stacktrace for every user. Stacktraces can be problematic (sensitive information and potentially large).

tsloughter avatar Jun 22 '22 23:06 tsloughter

So the stacktrace is included if the user adds a :stacktrace option on each query?

It is more common to see Repo configurations set at the application level, but for ecto they can be set on an individual query level as well.

# File: instrumentation/opentelemetry_ecto/config/test.exs
config :opentelemetry_ecto, OpentelemetryEcto.TestRepo,
  hostname: "localhost",
  username: "postgres",
  password: "postgres",
  database: "opentelemetry_ecto_test",
  pool: Ecto.Adapters.SQL.Sandbox,
  stacktrace: true  

Maybe if the PR took into account whether it was an "accurate" stacktrace somehow and only included it as an exception in that case I could see it being added. Would that be possible?

I think technically we could look at the repo in the metadata:

# File: instrumentation/opentelemetry_ecto/lib/opentelemetry_ecto.ex
stacktrace? = repo.config()[:stacktrace]

But I think this is a bad idea, since this doesn't account for if it is set or overridden at the individual query level.

Actually, is the stacktrace actually useful in this scenario? Particularly if it is a failure of the query I wouldn't think so. It would simply point you to the query, which is already given in the attributes.

When you get a query timeout or a postgres error code you will only see those details and nothing to help point you to the which query ran into an issue — I would argue that the stacktrace is very valuable.

# File: test/opentelemetry_ecto_test.exs
# Timeout example 
      Repo.transaction(fn ->
        :timer.sleep(:timer.seconds(1))
        Repo.all((from u in "users", select: u.non_existant_field))
      end, timeout: 10)
# message = "connection is closed because of an error, disconnect or timeout"      

Here are my thoughts on why the MR should go in as currently proposed.

  • We are only adding the stack trace if the query result is raising an exception. We are delegating to the library on when to record and exception.
  • If an exception is raised we will attempt include the stacktrace details in a backwards compatible fashion via 1 of 2 approaches:
    1. If stacktrace: true option is set (at application level or individual query level) we pull the pristine stacktrace from the event metadata and propagate it.
    2. if the pristine stacktrace is not available then passing in nil into OpenTelemetry.Span.record_exception/3 as the stacktrace will cause a new stacktrace to be generated which is accurate, but includes some extra context prefixed. This is the current default and functionality of OpenTelemetry.Span.record_exception/4 which is part of the opentelemetry_api implying that this is an acceptable default. This also is backwards compatible in the sense that the :stacktrace option was only available since v3.8.0 (2022-04-26) so we only want to use it if it currently set to not exclude any prior ecto versions.
  • This is similar to the approach taken by the opentelemetry_phoenix instrumentation.

If you have any concerns with these please let me know.

Sorry if this is frustrating :).

No problem. I am happy to work toward the best decision wherever that ends up.

spencerdcarlson avatar Jul 20 '22 17:07 spencerdcarlson