[receiver/otlpreceiver] Support Rate Limiting
Is your feature request related to a problem? Please describe.
The OpenTelemetry Specification outlines throttling for both gRPC and HTTP; however, the OTLP receiver does not currently support this (optional) specification.
Right now, if a processor is under pressure the only option it has is to return an error informing the receiver to tell the client that the request failed and is not retry-able.
Describe the solution you'd like
It would be neat if the receiver offered an implementation of error that could be returned to it to signal that it should return an appropriately formatted response to the client signaling that the request was rate limited. The format of the response should follow the semantic convention (e.g. the HTTP receiver should return a status code of 429 and set the "Retry-After" header.
For example, the OTLP receiver could export the following error implementation:
package errors
import "time"
type ErrorRateLimited struct {
Backoff time.Duration
}
func (e *ErrorRateLimited) Error() string {
return "Too Many Requests"
}
var ErrRateLimited error = &ErrorRateLimited{}
func NewErrRateLimited(backoff time.Duration) error {
return &ErrorRateLimited{
Backoff: backoff,
}
}
Any processor or exporter in the pipeline could return (optionally wrapping) this error:
import "go.opentelemetry.io/collector/receiver/otlpreceiver"
func (p *processor) ConsumeTraces(ctx context.Context, td ptrace.Traces) error {
return otlpreceiver.NewErrRateLimited(time.Minute)
}
Then when handling errors from the pipeline, the receiver could check for this error:
var (
err error
w http.ResponseWriter
)
errRateLimited := &ErrorRateLimited{}
if errors.As(err, &errRateLimited) {
w.Header().Set("Retry-After", strconv.FormatInt(int64(errRateLimited.Backoff)/1e9, 10))
w.WriteHeader(http.StatusTooManyRequests)
}
Describe alternatives you've considered
To accomplish rate limiting, a fork of the OTLP receiver will be used. Here are the changes: https://github.com/open-telemetry/opentelemetry-collector/compare/main...blakeroberts-wk:opentelemetry-collector:otlpreceiver-rate-limiting.
Additional context
The above example changes include the addition of an internal histogram metric which records server latency (http.server.duration or rpc.server.duration) to allow monitoring of the collector's latency, throughput, and error rate. This portion of the changes is not necessary to support rate limiting.
There exists an open issue regarding rate limiting (https://github.com/open-telemetry/opentelemetry-collector/issues/3509); however, the suggested approach seems to suggest the use of Redis which goes beyond what I believe necessary for the OTLP receiver to support rate limiting.
Can you clarify this is not only for OTLP but would be applicable to any pipeline with an exporter able to return this error?
Yeah that's a good point. The collector could have some general errors or receiver/errors package that any receiver (or possibly even scrappers?) could look for in the return value from their next consumer. One point to keep in mind though is that the shape of the response to the request in this case is in accordance to the OTel specification, but any non-OTLP receiver looking for these errors could handle it in accordance to their specification, if any.
https://github.com/open-telemetry/opentelemetry-collector/pull/9357 does not fully implement the OTel specification about OTLP HTTP throttling: there does not exist a way to set the Retry-After header.
@TylerHelmuth can you take a look?
The Retry-After header is optional. If the server has a recommendation for how the client should retry it can be set, but the server is not required to provide this recommendation (and often, may not be able to give a good recommendation).
If the client receives an HTTP 429 or an HTTP 503 response and the “Retry-After” header is not present in the response, then the client SHOULD implement an exponential backoff strategy between retries.
The 429/503 response codes are enough to get a OTLP client to start enacting an retry strategy.
Reading through the issue again I agree that we could introduce more to the collector to allow components to explicitly define how they want clients to retry in known, controlled scenarios. For that use case, this issue is not completed yet.
@TylerHelmuth Thank you for your response.
To support your analysis with personal experience: I have a custom processor that limits the number of unique trace IDs per service per minute. In this case, it is possible to determine the appropriate duration after which it should be permissible that the service resubmit their trace data.
Allowing an OTLP client to use exponential backoff is sufficient but not optimal. Optimal being a solution that, within system limits/limitations, reduces the duration between when an operation of a service creates a span or trace and when that span or trace is available to be queried from a backend storage system, and reduces the amount of resources (cpu, memory, network, i/o) required to report the span or trace from the originating service to a backend storage system. However, in most cases, the benefit from this optimization will be small if not negligible.
@blakeroberts-wk Can you opensource this custom processor that limits the number of unique trace IDs per service per minute.
Remark: samples based on rate limits are aviable here https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor https://pkg.go.dev/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor#section-readme
what is the latest update on this issue?
I wonder if a header or something should be sent from services that are throttled so that complete traces get sampled. Similar to the Completeness Property