skipper icon indicating copy to clipboard operation
skipper copied to clipboard

5xx responses not tagged as error by proxy

Open Skalsi opened this issue 3 years ago • 12 comments

Is your feature request related to a problem? Please describe. The issue is that 5xx errors from a backend do not get tagged as error:true in open tracing span. We are using a stream in Lightstep to track the availability of a service behind an API gateway. Since these operations don't get tagged as error:true, this leads to the error % of the stream being 0 despite there being 5xx errors. Screen Shot 2021-02-16 at 4 15 38 PM

Describe the solution you would like Give the possibility of tagging the span as error with a config option.

Describe alternatives you've considered (optional) Currently, we are in the process of developing a custom filter which can be added to the route and will mark the span with error:true depending on the response from the backend

Additional context (optional) Add any other context or screenshots about the feature request here.

Would you like to work on it? Yes

Skalsi avatar Feb 16 '21 16:02 Skalsi

I am strongly against this. Consider it this way: when a backend returns a 500, and the proxy successfully proxies this result, the proxy didn't fail, ergo it is not an error from the proxy perspective. My understanding about OpenTracing is that it is enough to mark those spans with error=true, where the operation of that particular span resulted in error, and evaluate the trace as a whole only during analytics time. There is no need to propagate the errors of the child spans in the parent spans.

There are various reasons for this, e.g. in general, a span can have multiple child spans, and it is possible that when a non-critical child span fails, the parent span is still considered as a success, either because the failing one is not critical, or because of relying on a fallback.

But there is a more important reason here, and this one is more specific. Imagine that someone monitors Skipper errors via OpenTracing, and for this, they set up a 'stream' with a condition of component=skipper && error=true. Now if the backend 5xx responses start setting error=true in skipper, that will completely break this use case.

My recommendation is to mark with error=true only the span in the backend application.

aryszka avatar Feb 26 '21 13:02 aryszka

@aryszka while I agree with you, there’s also a use case of using skipper as distributed loadbalancer and add it to an application pod to create automatically visibility for an application that is not instrumented, let’s say a 3rd party binary that you can’t easily instrument yourself. In this case you want to set it as error=true, so my vote would be to have a flag that decides if you want to have it set as error or not. The owner of the application can decide to set the flag or not. Wdyt?

szuecs avatar Feb 27 '21 19:02 szuecs

the use case is valid, but there alternative approaches. I still would avoid going against the OpenTracing semantics. Instead of using the well defined error tag, I would set a different tag, e.g. backend_error=true. This way we can equally search for the errors in the application, without contaminating the spans of Skipper or breaking OT concepts. This approach also avoids distinguishing between different deployment types.

aryszka avatar Mar 01 '21 16:03 aryszka

@aryszka @szuecs to me both use cases are valid, you might have users that want to have Skipper considering an operation failed based on HTTP semantics (given that Skipper is an HTTP reverse proxy) and you might have users that do not want that. So it depends, and as it depends I will think about adding config options to either activate this type of instrumentation or not, we can probably think to do it maybe even on route level? (aka with a filter) and if you want to have as a default add it to the default filters applied to every route?

@aryszka About

I am strongly against this. Consider it this way: when a backend returns a 500, and the proxy successfully proxies this result, the proxy didn't fail, ergo it is not an error from the proxy perspective. My understanding about OpenTracing is that it is enough to mark those spans with error=true, where the operation of that particular span resulted in error, and evaluate the trace as a whole only during analytics time.

I always have big doubts about this given that We want to have measurement as much as close as possible to the client, and see availability from their perspective. Please consider this analogy, you have a REST API your main span is called get-pets, this span has child spans eg. fetch-pets-from-database, this operation fails, you will or you will not mark get-pets operation as errored? I will do given that from a consumer point of view get-pets operation failed and then check the trace to see eventually which sub-operation caused the failure of the main one, I will do the same also if to get pets I will try to connect to another Web Service which returned 500. If you consider this as valid then Skipper used as Proxy or API Gateway is the same. The main operation should fail given a 5xx error was returned or triggered either by a sub-operation in skipper or in the peer.service (backend) which skipper was trying to connect to.

Considering now generally available tooling for collecting traces and measurement/analysis/reporting, if we take for example Lighstep and their stream concept, I would like to create a stream based on the first service and operation which is closest to the client (aka skipper-ingress or skipper used as API Gateway) to measure availability and latency provided to my consumers. SLO and alert on SLOs breaches alerts will be based on that latency and error are measured as closest to the client as possible considering that skipper can add latency and given that it can perform "business logic" operations such as rate limiting, authentication, authorization, etc. which they can fail and I don't want to "loose" the latency added or errors that might be triggered by this operations.

For Adaptive Paging, we can use for example peer.service tag to identify that the responsibility of the failure was the peer.service (backend) and not skipper itself (this will be used just if the backend service is not instrumented or was completely unreachable so the backend didn't even have the chance to start the tracing and mark its sub-operation as failed).

Abut using backend_error=true, this is not recognized by tooling such as Lightstep as an error so error rate metrics will not be available.

rbarilani avatar Mar 10 '21 09:03 rbarilani

ICYMI, Zalando's implementation of Adaptive Paging is aware of infrastructure components such as Skipper and handles it differently. It does not need skipper's spans to be marked as errors to continue its root-cause analysis.

I do not think that each service's individual ingress is the "closest to the client as possible". The closest will be the actual client.

This type of functionality is really seductive because it offloads a lot of the concerns to a common piece of infrastructure and may give the feeling that a lot can be achieved with less effort. While I agree this is true, experience has shown that this reduction to HTTP semantics reduces the signal to noise ratio. Applications tend to put less and less effort in their own instrumentation "because the ingress provides". The ingress will never be able to have the same level of domain knowledge that the actual target application holds. A concrete example where this doesn't work is that an HTTP 200 can be considered to be a failed operation (e.g. providing degraded results). The target application will be able to do so, the ingress won't. The functionality being proposed incentivizes a behavior where no one will consider putting that effort anymore.

The vendor-specific concerns that @rbarilani shared sound logical. IMHO, they should be addressed in the vendor specific layers. If one wants to consider HTTP 5XX a good measurement of availability, that can be achieved locally by filtering for HTTP 5XX spans. IMHO The separation of the tracing error semantics by means of the error tag from transport semantics such as HTTP is a blessing that should be exploited instead of eliminated.

lmineiro avatar Mar 10 '21 10:03 lmineiro

@lmineiro All valid points, we discussed that also offline though I still think that as Skipper is a multi-purpose tool (can be used as an ingress, an API gateway, a simple reverse proxy in front of a service which I can't control) we should not force concepts that are specific to the domain or our beliefs into it but let the user decide how each operation should be instrumented without touching the core code or fork the repository to change the OpenTracing instrumentation behavior. Eg. I see that ATM 404 are marked as an error since a "Route Lookup failed" this behaviour might be right for certain use-cases but not valid for others, for example, a public service will receive a lot of 404 requests (from bots, users wrongly typing a URL, etc.) and they do not represent an error IMHO.

The HTTP status code is used as a discriminator to define Availability SLOs also in Google SRE SLO document workbook example if users want, since is convenient for them, to use OpenTracing to monitor the availability of their service/operation, why they shouldn't be able to do it? Or why we can't make that easy for them for generally available vendors?

rbarilani avatar Mar 18 '21 13:03 rbarilani

@rbarilani you can also think from the position of the owner of the infrastructure service, that wants to see "route lookup failed" as error. Right now I have no strong opinion. I basically had never the problem, that someone asked me to investigate a routing issue like this.

szuecs avatar Mar 18 '21 19:03 szuecs

@szuecs I agree with you, you should still be able to detect this problem and you can with "logs or tags", you should still be able to decide to mark that as an error as infrastructure owner or not depending on your use case and how are using skipper (as REST API Gateway vs as an Ingress).

rbarilani avatar Mar 22 '21 10:03 rbarilani

@szuecs off-topic for this issue but related to the routeLookupFailder error and open tracing, what about opening a child span for the match_route operation and mark that as an error when no match without marking as an error the parent (main) span?

rbarilani avatar Mar 27 '21 11:03 rbarilani

@rbarilani considering that I had to investigate a lightstep issue because we have too many spans collected in our buffer to be able to send these, I would say not a great idea. I would rather reduce the number of spans: @aryszka we should not have one span per redis operation.

@rbarilani additionally the overhead of a span is not 0 and the more we have the more garbage we create that needs GCed again. On the other hand if everyone thinks it would be a great solution to enable your use case, I won’t say no.

szuecs avatar Mar 27 '21 22:03 szuecs

@szuecs apart of the overhead of creating spans that need to GCed, about

investigate a light step issue because we have too many spans collected in our buffer to be able to send these?

Is it a vendor-specific limitation or is it a generic problem applicable to any vendor? At which load (ops) does it start to become a problem?

rbarilani avatar Mar 29 '21 07:03 rbarilani

@rbarilani Span size is more the problem than load. In case you worry about it please update lightstep library to latest, because I created a fix for it.

szuecs avatar Apr 15 '21 18:04 szuecs