jaeger Do not create ES indices too far into the past/future

Requirement - what kind of business use case are you trying to solve?

Jaeger performance should not degrade due to bugs in services reporting spans.

Problem - what in Jaeger blocks you from solving the requirement?

One of our services (due to some bug) reports spans with begin timestamps far into the future (years). This causes a lot of indices to be created in the elasticsearch for strange dates because ES indices are created per day. For example:

$ curl $ESADDR/_cat/shards -s | wc -l
9986
$ curl $ESADDR/_cat/indices -s | head
green open jaeger-service-8160-12-02   E759tRv5TeyZD0cB2xubCw 48 0          1   0   9.4kb   9.4kb
green open jaeger-span-8154-09-01      tDhwyd9sQEqIgDYxmd8xXw 48 0          0   0     6kb     6kb
green open jaeger-span-8157-02-28      4Yam481RTF-jS7NpGKBS1g 48 0          0   0     6kb     6kb
green open jaeger-service-8154-03-05   YGICrbBkTNaFu-tiaoFYbw 48 0          1   0   9.1kb   9.1kb
green open jaeger-service-8148-12-15   97E695J7R3uyFSImODLx7g 48 1          1   0    19kb   9.5kb
green open jaeger-span-8151-11-09      U1owZ6ieQ6CrPMHlmcNc1g 48 0          0   0     6kb     6kb
green open jaeger-span-8160-02-18      FFVy6vQ1RamzyauHT_Njow 48 0          0   0     6kb     6kb
green open jaeger-span-8156-02-26      -hNsr6rtS6CCxAUH4HrJmw 48 0          0   0     6kb     6kb
green open jaeger-service-208917-08-31 Qx7Xm9jbQe-3Lf1ryv0paQ 48 0          1   0   9.1kb   9.1kb
green open jaeger-service-8163-10-27   g5mLfk9IQQCvYlCooWWqWQ 48 0          1   0   9.4kb   9.4kb9.5kb

This impacts ES cluster for example because each index's shard holds own file handles. Additionally our curator script does not remove those indices as they are considered to be in the future (and only past ones are removed).

Proposal - what do you suggest to solve the problem or improve the existing situation?

Restrict in the collector what timestamps are allowed and reject spans which are too old or too far into the future. E.g. not older than 14 days, at most 1 day into the future. Drop spans outside of this range or save them into the "current" index. The range could be configurable.

May 23 '18 09:05 mabn

I think this is a good idea.

Internally, we've run into a problem where services were setting timestamps far enough in the future to cause overflows.

Drop spans outside of this range or save them into the "current" index.

I don't think dropping spans for incorrect timestamps is reasonable, instead we could overwrite the timestamp with the ingestion time, and log a warning on the span. (Ideally, we'd like users to be able to retrieve these spans as part of a trace even if the timestamps are invalid). I'm not sure whether saving them into the current index accomplishes the same thing.

May 23 '18 13:05 vprithvi

Overwriting sounds good.

May 23 '18 14:05 mabn

It turns out that it isn't a bug in the service - I've added extra logging wrapped around Sender and it didn't catch anything. I suspect that once in a while UDP packets sent to the agent are corrupted.

Based on number of extra indices in ES It happens few times per 10^9 spans.

Jun 12 '18 09:06 mabn

Just adding a note this behavior is not present when using rollover aliases --es.use-aliases flag as It uses a single index to write data.

Dec 05 '19 14:12 pavolloffay

I think this is a good idea.

Internally, we've run into a problem where services were setting timestamps far enough in the future to cause overflows.

Drop spans outside of this range or save them into the "current" index.

I don't think dropping spans for incorrect timestamps is reasonable, instead we could overwrite the timestamp with the ingestion time, and log a warning on the span. (Ideally, we'd like users to be able to retrieve these spans as part of a trace even if the timestamps are invalid). I'm not sure whether saving them into the current index accomplishes the same thing.

Solution to this problem would really help. Currently we run into the trouble with elasticsearch having too many indices.

I agree that we should rewrite timestamps if they are in the future. But ideally have a flag where the user can decide if they want our of order spans (aka future spans) or rewrite timestamp. Similar to what Vector has: https://vector.dev/docs/reference/configuration/sinks/loki/#out_of_order_action

Jul 29 '22 18:07 mehta-ankit

jaeger jaeger copied to clipboard

Do not create ES indices too far into the past/future

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

jaeger
jaeger copied to clipboard