jaeger icon indicating copy to clipboard operation
jaeger copied to clipboard

InfluxDB as trace storage backend

Open yurishkuro opened this issue 7 years ago • 19 comments

Meta-issue no storage backends: #638

There is some work happening here https://github.com/openzipkin/zipkin/issues/1628

My interest at this time is what features such implementation could provide, i.e.

  • what would be write throughput per node with RF=2
  • could the backend support indexing of arbitrary tags / log fields, or do they need to be pre-defined
  • what is the write amplification or perf impact as a function of # of tags/fields per span
    • in Cassandra backend every tag is an extra write
    • in ES it's extra indexing time on the server
  • will the backend support correct server-side joins and LIMIT (broken with Cassandra today)
  • how search with multiple tags would be handled
    • in Cassandra it's an AND across different spans from the same service name (weird)
    • in ES it's an AND across tags from the same span only (index document is one span)
  • could the backend support latency aggregates out of the box (by service/endpoint)? This one is something I'd expect InfluxDB to be able to do easily, since it's fundamentally a TS db

cc @gianarb @goller

yurishkuro avatar Jul 15 '17 18:07 yurishkuro

From personal experience I've easily done 6 million points per minute to a single node with no issue using the recommended 5000 points per request, however batching is a big key, as your batches get smaller write performance reduces drastically.

In terms of arbitrary tags / log fields, they do not need to be predefined, however fields cannot have a mixed type, so once you set fieldA=int64, fieldA always has to be an int64.

For indexing, tags are always indexed, fields are never indexed. This means that cardinality of tags is a big issue since Influx creates an in-memory index for all tags (might be okay with their new TSI) and any query against a field looking for a specific value causes a scan of the data - this is usually okay since you're generally querying by time span, but something to keep in mind.

Aggregations can be easily implemented with their built in aggregation functions and a groupby service and endpoint

xjerod avatar Jul 16 '17 03:07 xjerod

Hi @yurishkuro we'd like to contribute influxdb as a trace support backend. Currently, we are getting experience with writing spans with telegraf into InfluxDB running with the new TSI engine

@jrbury is absolutely correct on all points. The TSI engine is built to handle much higher cardinality. Here is how we define cardinality: https://docs.influxdata.com/influxdb/v1.3/concepts/glossary/#series-cardinality

I believe that the trace id will dominate the cardinality.

Regarding your other questions:

will the backend support correct server-side joins and LIMIT (broken with Cassandra today)

Influx does not have server-side joins per se, but, it is able to group by any number of tags. Additionally, influx has several meta queries using the SHOW keywords that are used to get information about tag sets. The SELECT and SHOW queries both support LIMIT.

how search with multiple tags would be handled

Multiple tags can be handled with a WHERE clause. The WHERE clause would not neeed to be restrictions of a single service name or single span. I believe it should "just work."

could the backend support latency aggregates out of the box (by service/endpoint)? This one is something I'd expect InfluxDB to be able to do easily, since it's fundamentally a TS db

Yes, I believe this should be in our wheelhouse for sure.

So, what do you think of us trying our hand at implementing the store?

goller avatar Jul 27 '17 22:07 goller

As if I could try to stop you!

Seriously though, if you have the cycles and the desire to do this, then by all means. I recommend doing it in some other repo so that you don't have to go through our code reviews until you have a working proof of concept and run some integration and stress tests. Note that we have some integration tests that (in theory) should work across different storage backends - ./plugin/storage/integration/...

yurishkuro avatar Jul 27 '17 22:07 yurishkuro

@goller just saw this https://github.com/influxdata/jaeger. Just curious - why are you going after zipkin's nomenclature ("binary annotations" etc. ) instead of OpenTracing, given that you're already operating on Jaeger's domain model? It seems like extra work. Note that Jaeger backend can both produce and consume Zipkin model if necessary.

yurishkuro avatar Aug 22 '17 02:08 yurishkuro

Probably a lack of experience on my part !

On Aug 21, 2017, at 9:39 PM, Yuri Shkuro [email protected] wrote:

@goller just saw this https://github.com/influxdata/jaeger. Just curious - why are you going after zipkin's nomenclature ("binary annotations" etc. ) instead of OpenTracing, given that you're already operating on Jaeger's domain model? It seems like extra work. Note that Jaeger backend can both produce and consume Zipkin model if necessary.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

goller avatar Aug 22 '17 03:08 goller

@yurishkuro To better understand zipkin's model, we implemented a telegraf plugin here: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/zipkin

Our goal is to support OpenTracing for sure, but, we figured we would support zipkin's data model to store into influxdb via telegraf. That way both jaeger and zipkin could read data from it.

Do you think it would be better for the collection of spans to be stored using the OpenTracing naming?

goller avatar Aug 22 '17 04:08 goller

Since Chris is new to all this, he should know jury is out on whether there's a data model for opentracing. I'd be careful to pre-emptively label anything as such as it might mislead people or clash with an actual spec.

https://github.com/opentracing/specification/issues/64

IOTW, jaeger definitely wrote their model around naming inside OpenTracing, but that doesn't imply there's any official or stable means to do that. If you model based on jaeger, you are just modeling based on jaeger.

On Tue, Aug 22, 2017 at 12:06 PM, Chris Goller [email protected] wrote:

@yurishkuro https://github.com/yurishkuro To better understand zipkin's model, we implemented a telegraf plugin here: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/zipkin

Our goal is to support OpenTracing for sure, but, we figured we would support zipkin's data model to store into influxdb via telegraf. That way both jaeger and zipkin could read data from it.

Do you think it would be better for the collection of spans to be stored using the OpenTracing naming?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uber/jaeger/issues/272#issuecomment-323913586, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD616JzWT81QScJHw286XLMz0xMdTfiks5salPGgaJpZM4OZF-O .

codefromthecrypt avatar Aug 22 '17 04:08 codefromthecrypt

@goller zipkin model does not support all features of OpenTracing, such as KV-logs and span references. Because of that the transformation from Jaeger to Zipkin data model can be lossy. If you're implementing Jaeger backend with InfluxDB, it seems to make more sense for that backend to use Jaeger data model and not be lossy.

yurishkuro avatar Aug 22 '17 04:08 yurishkuro

@goller btw jaeger-collector can accept Zipkin spans in various formats at :9411/api/v1/spans. It converts them to Jaeger internal data model that SpanWriter/SpanReader are operating on.

yurishkuro avatar Aug 22 '17 05:08 yurishkuro

I'm going to stop commenting on this issue. Suffice to say please do not conflate this work with Zipkin as lossiness is a point of view and point in time thing. Yuri's perspective of things is just that. he doesn't represent zipkin.

codefromthecrypt avatar Aug 22 '17 05:08 codefromthecrypt

FYI active work on this issue: https://github.com/influxdata/jaeger/tree/influxdb

Today, this branch works with InfluxDB 2.0 alpha. It works today, but I won't open a PR until we've used it ourselves for a while.

jacobmarble avatar Feb 15 '19 22:02 jacobmarble

The plugin framework issue: https://github.com/jaegertracing/jaeger/issues/422

yurishkuro avatar Feb 22 '19 16:02 yurishkuro

FYI we have moved our active work to a new repo, which uses the gRPC framework: https://github.com/influxdata/jaeger-influxdb

jacobmarble avatar May 10 '19 21:05 jacobmarble

@jacobmarble is the repo available? I got 404

juanpabloaj avatar May 10 '19 22:05 juanpabloaj

Should be public now.

On Fri, May 10, 2019 at 5:06 PM JuanPablo [email protected] wrote:

@jacobmarble https://github.com/jacobmarble is the repo available? I got 404

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/jaegertracing/jaeger/issues/272#issuecomment-491443575, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEX5N3OPZOGNHWWH6QH5LLPUXWXDANCNFSM4DTEL6HA .

-- Jacob

jacobmarble avatar May 15 '19 04:05 jacobmarble

@jacobmarble Did https://github.com/influxdata/jaeger-influxdb get moved to another location? The link from the docs 404s

MattBoatman avatar May 27 '21 15:05 MattBoatman

Looks like the repo is available :-)

jpkrohling avatar Jun 03 '21 08:06 jpkrohling

@MattBoatman I'm not sure why you got a 404.

Related, that repository will be archived in the next few months, as its replacement stabilizes. A new InfluxDB storage engine is in development, which handle traces much better than the current engine. This new Jaeger plugin is designed around a schema which is friendly to both OpenTelemetry and the new storage engine: https://github.com/influxdata/influxdb-observability/tree/main/jaeger-query-plugin

jacobmarble avatar Jun 04 '21 15:06 jacobmarble

@jacobmarble I emailed the influx team and they restored the repo ;) Good to know, I was just following the links from the jaeger docs

MattBoatman avatar Jun 04 '21 15:06 MattBoatman

There is a newer version of this which works with iOx the new engine. https://github.com/influxdata/influxdb-observability/tree/main/jaeger-query-plugin the older repo is only for v1 and v2 of InfluxDB.

jkowall avatar Nov 04 '22 15:11 jkowall

Last repo link, I promise: https://github.com/influxdata/influxdb-observability

More specifically: https://github.com/influxdata/influxdb-observability/tree/main/jaeger-influxdb

jacobmarble avatar Mar 31 '23 17:03 jacobmarble