jaeger
jaeger copied to clipboard
Do not create ES indices too far into the past/future
Requirement - what kind of business use case are you trying to solve?
Jaeger performance should not degrade due to bugs in services reporting spans.
Problem - what in Jaeger blocks you from solving the requirement?
One of our services (due to some bug) reports spans with begin timestamps far into the future (years). This causes a lot of indices to be created in the elasticsearch for strange dates because ES indices are created per day. For example:
$ curl $ESADDR/_cat/shards -s | wc -l
9986
$ curl $ESADDR/_cat/indices -s | head
green open jaeger-service-8160-12-02 E759tRv5TeyZD0cB2xubCw 48 0 1 0 9.4kb 9.4kb
green open jaeger-span-8154-09-01 tDhwyd9sQEqIgDYxmd8xXw 48 0 0 0 6kb 6kb
green open jaeger-span-8157-02-28 4Yam481RTF-jS7NpGKBS1g 48 0 0 0 6kb 6kb
green open jaeger-service-8154-03-05 YGICrbBkTNaFu-tiaoFYbw 48 0 1 0 9.1kb 9.1kb
green open jaeger-service-8148-12-15 97E695J7R3uyFSImODLx7g 48 1 1 0 19kb 9.5kb
green open jaeger-span-8151-11-09 U1owZ6ieQ6CrPMHlmcNc1g 48 0 0 0 6kb 6kb
green open jaeger-span-8160-02-18 FFVy6vQ1RamzyauHT_Njow 48 0 0 0 6kb 6kb
green open jaeger-span-8156-02-26 -hNsr6rtS6CCxAUH4HrJmw 48 0 0 0 6kb 6kb
green open jaeger-service-208917-08-31 Qx7Xm9jbQe-3Lf1ryv0paQ 48 0 1 0 9.1kb 9.1kb
green open jaeger-service-8163-10-27 g5mLfk9IQQCvYlCooWWqWQ 48 0 1 0 9.4kb 9.4kb9.5kb
This impacts ES cluster for example because each index's shard holds own file handles. Additionally our curator script does not remove those indices as they are considered to be in the future (and only past ones are removed).
Proposal - what do you suggest to solve the problem or improve the existing situation?
Restrict in the collector what timestamps are allowed and reject spans which are too old or too far into the future. E.g. not older than 14 days, at most 1 day into the future. Drop spans outside of this range or save them into the "current" index. The range could be configurable.
I think this is a good idea.
Internally, we've run into a problem where services were setting timestamps far enough in the future to cause overflows.
Drop spans outside of this range or save them into the "current" index.
I don't think dropping spans for incorrect timestamps is reasonable, instead we could overwrite the timestamp with the ingestion time, and log a warning on the span. (Ideally, we'd like users to be able to retrieve these spans as part of a trace even if the timestamps are invalid). I'm not sure whether saving them into the current index accomplishes the same thing.
Overwriting sounds good.
It turns out that it isn't a bug in the service - I've added extra logging wrapped around Sender and it didn't catch anything. I suspect that once in a while UDP packets sent to the agent are corrupted.
Based on number of extra indices in ES It happens few times per 10^9 spans.
Just adding a note this behavior is not present when using rollover aliases --es.use-aliases
flag as It uses a single index to write data.
I think this is a good idea.
Internally, we've run into a problem where services were setting timestamps far enough in the future to cause overflows.
Drop spans outside of this range or save them into the "current" index.
I don't think dropping spans for incorrect timestamps is reasonable, instead we could overwrite the timestamp with the ingestion time, and log a warning on the span. (Ideally, we'd like users to be able to retrieve these spans as part of a trace even if the timestamps are invalid). I'm not sure whether saving them into the current index accomplishes the same thing.
Solution to this problem would really help. Currently we run into the trouble with elasticsearch having too many indices.
I agree that we should rewrite timestamps if they are in the future. But ideally have a flag where the user can decide if they want our of order spans (aka future spans) or rewrite timestamp. Similar to what Vector has: https://vector.dev/docs/reference/configuration/sinks/loki/#out_of_order_action