[EVENT OUTCOME] 4xx status codes are always reported as errors
Describe the bug: For any calls to services, all 4xx status codes are reporting as error. This includes expected 401 / 403 / 404 codes, which just indicate an issue with the requester (not an error in code).
This causes our alerts to be useless since we cannot tell if something is an error (to alert every time to fix) or just an invalid request (to alert for anomalies).
To Reproduce
- Set up monitoring for a project with GCS
- Have code call to GCS bucket to check for existence on a file that doesn't exist (GCS will return 404)
- Pull up dashboard (or alerts)
- See that it is reported as an error
Environment (please complete the following information)
- OS: Linux
- Python version: 3.8
- Framework and version [e.g. Django 2.1]: Django 3.8
- APM Server version: 7.15
- Agent version: 6.14.0
Additional context
-
requirements.txt:Click to expand
google-cloud-storage==1.42.0
Observations It seems to be handled differently on different parts of code:
- in traces.py it is looking for 5xx's to error
- in instrumentation it is looking for 4xx's to error
Suggestions Could we have 4xx's return as invalid request or warning levels instead? We could still be able to monitor and alert for them, but differently than 5xx's
In general, this seems like expected behavior to me. When we're the server, as in the case of most traces, incoming requests should only be a "failure" for HTTP 500+. But when we're calling out to other services, a 404 is a problem, because we're the client in this case. If I'm a client reaching out to a service and receive a 4xx, that's a failed request, right? No code problems on the server, presumably, but something has gone wrong.
The GCS bucket use case is an edge case, and an interesting one. For S3, this is handled in the botocore instrumentation, which creates leaf spans (meaning underlying http spans will be discarded, as they don't add additional useful information). This is why you wouldn't see the same problem for S3 calls, because the botocore instrumentation handles success in a service-specific way.
Are you interacting with GCS directly via requests to their API, or are you using a library to interact with it? If it's a library, we can add a GCS-specific instrumentation to do "success"-handling in a GCS-specific way. You could also wrap your calls with elasticapm.capture_span with leaf=True to discard the underlying span and set the outcome yourself.
Could we have 4xx's return as invalid request or warning levels instead? We could still be able to monitor and alert for them, but differently than 5xx's
Currently, spans have only two outcomes, success and failure. Having something in the middle is an interesting idea, though...
So currently, we have two use cases that cause our error logs to go haywire.
- For checking authentication, we return 403 / 401 on each page if their session expired. This is expected behavior, and we want to alert if we see an anomalous increase (i.e. our authentication service is ALWAYS returning 403s)
- For GCS, we're using their distributed library, which calls a
.existsand that call returns a 404 if it doesn't exist and the library returnsFalse.
I will look into that wrapper to see if we can do that, but since it's so far in a library, i don't know if it will be feasible :(
For checking authentication, we return 403 / 401 on each page if their session expired. This is expected behavior, and we want to alert if we see an anomalous increase (i.e. our authentication service is ALWAYS returning 403s)
Interesting. Yeah, for the server handling code, that transaction should definitely not be an error. But again, as a client, if I'm getting a 4xx for authentication it still feels like that span is a "failure".
For GCS, we're using their distributed library, which calls a .exists and that call returns a 404 if it doesn't exist and the library returns False.
Yeah we should definitely add a proper instrumentation for that library: https://github.com/elastic/apm-agent-python/issues/1763
I come across this issue when using Python Celery (Another library with APM instrumentation provided) and using S3 as the results back end. Celery will query S3 for the results as a HEAD request. If the celery task hasn't completed yet, S3 will return the 404. For longer running tasks, there's a fair number of failures on those 404s.
I don't have a clean suggestion as in some cases a 404 really is a failure. In others, it's not.