veneur Introduce a new metrics client that supports dependency injection.

I'd like to propose a metrics client approach that avoids top level functions and allows for dependency injection.

Motivation We have two prescribed ways of adding metrics today:

a.) creating a span, adding metrics, and finishing it. b.) creating a SSF.Sample, adding metrics to it, and calling metrics.ReportBatch to submit it.

These approaches are tightly coupled with implementation and impose strict dependencies on clients. It's not possible to replace ReportBatch with a mock object, for example, or a different stats backend. Consider this use case:

type MyTask struct{
    metricsClient metrics
}

type metrics interface {
    Count(string, int, map[string]string)
}

func (t MyTask) DoJob(ctx context.Context) {
    // do things ...
    t.metricsClient.Count("finished", 1)
}

The above code has a couple of nice qualities about it:

a.) It doesn't depend on the metrics implementation in any way. It doesn't even import anything metrics related! We can change out the metrics implementation to a mock for testing, for example. This also facilitates simpler migrations if we need to change out the metrics implementation.

b.) It doesn't need to know anything about tracing. This is important since it's not always practical to start a span in every function. For example, I may want to use runtime.trace instrumentation instead of distributed tracing for in-process traces. Moreover, extracting spans from context is very fragile. If DoJob's caller isn't using veneur's tracing library, for example, we can't use that approach to insert metrics.

c.) We don't lose any of the functionality we had before. If we want to add metrics to a span, whoever constructs MyTask needs to inject a SSFAddingClient configured with the trace we want to attach to. This is entirely opaque to MyTask's code.

This approach is much more idiomatic golang and affords us greater flexibility.

By contrast, this is what today's code looks like:

type MyTask struct{}

type metrics interface {
    Count(string, int, map[string]string)
}

// either this, which can't be mocked
func (t MyTask) DoJob(ctx context.Context) {
    // do things ...
    sample := &ssf.SSFSample{}
    sample.Add(...)
    metrics.BatchReport(...)
}

// or this
func (t MyTask) DoJob(ctx context.Context) {
    span, ok := opentracing.SpanFromContext(ctx).(*trace.Trace)
    if !ok {
        // no access to a span so... do we create a new one? but what if there's good reasons not to here?
    }
}

For now, I've hidden the implementation inside internal/ because I don't want to make API commitments to external users until we have some experience and can finalize the API.

For a bit more context on why this is the preferred idiom in Go, I wrote about dependency injection and interface use here: https://docs.google.com/document/d/1phD0rjOtmOg2nwponuiXmM02YTpYUtW_KmTun_b_UTo/edit#heading=h.z2a5c0bfp590

r? @stripe/observability r? @ChimeraCoder @asf-stripe @cory-stripe

Nov 08 '18 00:11 clin-stripe

Gerald Rule: Copy Observability on Veneur, Unilog, Falconer pull requests

cc @stripe/observability cc @stripe/observability-stripe

Nov 08 '18 01:11 stripe-ci

I like the way you think, but I would much rather go a different direction. Some reasons:

I super disagree with the "not having to know about tracing" point - the trace span is the atomic unit of the SSF philosophy, and we should embrace it, or completely throw out SSF (though I think it's the right level of abstraction).
I feel pretty attached to this interface, partially because I wrote it (lol) but also because I genuinely think it's the one that gives users the most control what happens with their traces. That may again be because I wrote it, but hm.
We do lose some functionality we had before - with the SSF client, you can report a batch of metrics, which (at least as far as I know) reduces the amount of protobuf-encoding and syscalling we do to in order to submit these things. This is not a thing that we would need to worry about in tests, but we do have services currently in use that spew enough data that veneur's syscall volume itself consumes almost a whole cpu core. Adding more wouldn't be great.
We can do it without changing the interface out at all - you could just as well write a statsd backend for the currently-existing trace client, for use with NewBackendClient, which is a thing that we already use to construct trace clients for tests. IN FACT, that could be used to really really good effect in the span-processing internals, where we have previously had to be very careful so we don't write-amplify. Something like:

// NewStatsdOnlyBackend returns a statsd backend that can be used with
// veneur's trace.NewBackendClient to construct a trace client that
// reports only metrics, through the statsd protocol.
func NewStatsdOnlyBackend(c *statsd.Client) ClientBackend {
	return &statsdBackend{
		c: c,
	}
}

// Close is a no-op.
func (sb *statsdBackend) Close() error {
	return nil
}

// SendSync reports every sample on the span as a statsd metric, and
// discards the span data itself.
func (sb *statsdBackend) SendSync(ctx context.Context, span *ssf.SSFSpan) error {
	for _, s := range span.Metrics {
		srMultiplier := 1 / s.SampleRate

		tags := make([]string, 0, len(s.Tags))
		for tn, tv := range s.Tags {
			tags = append(tags, fmt.Sprintf("%s:%s", tn, tv))
		}

		switch s.Metric {
		case ssf.SSFSample_COUNTER:
			sb.c.Count(s.Name, int64(s.Value*srMultiplier), tags, 1)
		case ssf.SSFSample_GAUGE:
			sb.c.Gauge(s.Name, float64(s.Value), tags, 1)

			// ...
		}
	}
        return nil
}

...so I think the above would be kinda neat because then we can use trace clients all the way (no more of that statsd lurk) and not worry about write-amplifying.

I should add I'm also not super in love with the interface you wrote up... having to write all these methods seems not-great and adds a lot of overhead for prospective trace client writers, compared the two (with an optional third) that ClientBackend requires...

Nov 08 '18 11:11 asf-stripe

Let's set up some time to discuss, I think some signal is being lost over text.

Nov 08 '18 16:11 clin88

Coming along a bit late here! :ie: The gist of making this more ergonomic and testable is all marvelous!

If I'm reading this right — and forgive me, I am poor at this when traveling — @asf-stripe's suggestion feels more in line with how Go's HTTP libraries work in that you use the same top level client but swap out the Transport if you want to test. This way clients don't have to make an implementation of the client, only of the backend and we can easily create a testable one you can plop in!

But, to reiterate any improvement here is good!

Nov 08 '18 16:11 cory-stripe

@asf-stripe and I had a good conversation.

We'd like to keep talking about it but we should pace the conversation more deliberately. I don't feel the need to push this now... copying what I wrote in slack:

hey, regarding dependency injection and trace library asf and i had a good conversation it sounds like this is a big change for the team id like to keep having this conversation, but i dont feel the need to introduce this now. maybe after everyone has commented, we can get the people with the biggest stake into a room and hash it out the thing i think we need to agree on before hashing out implementation is what problems we’re solving for. what use cases can we anticipate? what does the client have to look like to facilitate this? ill follow up w people and figure out whats a good pace to have that talk... id like to wait for aditya to be back and i know kris had a proposal for the client lib as well

Nov 08 '18 19:11 clin-stripe

veneur veneur copied to clipboard

Introduce a new metrics client that supports dependency injection.

Gerald Rule: Copy Observability on Veneur, Unilog, Falconer pull requests

veneur
veneur copied to clipboard