apm-agent-python icon indicating copy to clipboard operation
apm-agent-python copied to clipboard

[EVENT OUTCOME] 4xx status codes are always reported as errors

Open sublime1809 opened this issue 2 years ago • 5 comments

Describe the bug: For any calls to services, all 4xx status codes are reporting as error. This includes expected 401 / 403 / 404 codes, which just indicate an issue with the requester (not an error in code).

This causes our alerts to be useless since we cannot tell if something is an error (to alert every time to fix) or just an invalid request (to alert for anomalies).

To Reproduce

  1. Set up monitoring for a project with GCS
  2. Have code call to GCS bucket to check for existence on a file that doesn't exist (GCS will return 404)
  3. Pull up dashboard (or alerts)
  4. See that it is reported as an error

Environment (please complete the following information)

  • OS: Linux
  • Python version: 3.8
  • Framework and version [e.g. Django 2.1]: Django 3.8
  • APM Server version: 7.15
  • Agent version: 6.14.0

Additional context

  • requirements.txt:

    Click to expand
    google-cloud-storage==1.42.0
    

Observations It seems to be handled differently on different parts of code:

Suggestions Could we have 4xx's return as invalid request or warning levels instead? We could still be able to monitor and alert for them, but differently than 5xx's

sublime1809 avatar Feb 17 '23 17:02 sublime1809

In general, this seems like expected behavior to me. When we're the server, as in the case of most traces, incoming requests should only be a "failure" for HTTP 500+. But when we're calling out to other services, a 404 is a problem, because we're the client in this case. If I'm a client reaching out to a service and receive a 4xx, that's a failed request, right? No code problems on the server, presumably, but something has gone wrong.

The GCS bucket use case is an edge case, and an interesting one. For S3, this is handled in the botocore instrumentation, which creates leaf spans (meaning underlying http spans will be discarded, as they don't add additional useful information). This is why you wouldn't see the same problem for S3 calls, because the botocore instrumentation handles success in a service-specific way.

Are you interacting with GCS directly via requests to their API, or are you using a library to interact with it? If it's a library, we can add a GCS-specific instrumentation to do "success"-handling in a GCS-specific way. You could also wrap your calls with elasticapm.capture_span with leaf=True to discard the underlying span and set the outcome yourself.

basepi avatar Feb 27 '23 18:02 basepi

Could we have 4xx's return as invalid request or warning levels instead? We could still be able to monitor and alert for them, but differently than 5xx's

Currently, spans have only two outcomes, success and failure. Having something in the middle is an interesting idea, though...

basepi avatar Feb 27 '23 18:02 basepi

So currently, we have two use cases that cause our error logs to go haywire.

  1. For checking authentication, we return 403 / 401 on each page if their session expired. This is expected behavior, and we want to alert if we see an anomalous increase (i.e. our authentication service is ALWAYS returning 403s)
  2. For GCS, we're using their distributed library, which calls a .exists and that call returns a 404 if it doesn't exist and the library returns False.

I will look into that wrapper to see if we can do that, but since it's so far in a library, i don't know if it will be feasible :(

sublime1809 avatar Feb 27 '23 19:02 sublime1809

For checking authentication, we return 403 / 401 on each page if their session expired. This is expected behavior, and we want to alert if we see an anomalous increase (i.e. our authentication service is ALWAYS returning 403s)

Interesting. Yeah, for the server handling code, that transaction should definitely not be an error. But again, as a client, if I'm getting a 4xx for authentication it still feels like that span is a "failure".

For GCS, we're using their distributed library, which calls a .exists and that call returns a 404 if it doesn't exist and the library returns False.

Yeah we should definitely add a proper instrumentation for that library: https://github.com/elastic/apm-agent-python/issues/1763

basepi avatar Feb 27 '23 19:02 basepi

I come across this issue when using Python Celery (Another library with APM instrumentation provided) and using S3 as the results back end. Celery will query S3 for the results as a HEAD request. If the celery task hasn't completed yet, S3 will return the 404. For longer running tasks, there's a fair number of failures on those 404s.

I don't have a clean suggestion as in some cases a 404 really is a failure. In others, it's not.

coolacid avatar Sep 22 '23 20:09 coolacid