[BUG] Tracing index is not re-created in opensearch. Dataprepper needs restart?
Describe the bug When events are sent to opensearch, usually the index is created if it doesn't exist. This happens for all data except when dataprepper recieves traces. When dataprepper starts up, it creates the necessary tracing indexes for spans and servicemaps, once but never again unless restarted.
If the index is removed during dataprepper runtime, an error saying "index is missing" shows up extremely often, possibly filling up the buffer and eventually causing packetdrops.
To Reproduce Send traces to dataprepper as per usual. And you will see the trace index in the management/index page on in the "opensearch dashboards gui".
However, if you delete the index, it never gets recreated again! Even if new traces are being sent to dataprepper! Only re-starting dataprepper seems to "recreate" the index again. This can probably be easily fixed so that indexes are recreated if they don't exist in opensearch.
Expected behavior 1: the Span Index needs to be re-created if it doesn't exist when new events come to dataprepper. (And when they are sent to opensearch).
2: the serviceMap index needs to be re-created if it doesn't exist.
Environment (please complete the following information): I tried this in dataprepper on kubernetes using the dataprepper helmchart.
Additional context I tried this using the otel demo apps. It seems pretty consistent with all their traces. If the index is removed, it never gets re-created again unless dataprepper is restarted. Neither the "service-map index" nor the "span index" get recreated.
This might be related to #3342 and maybe #3506. The index setup used for spans is a little complicated. It usually uses a write alias, that points to a concrete span index.
@AdaptiveStep can you elaborate on your setup? Do you use the default index configuration or do you provide a custom config? When you delete the current index, do you keep the write alias if you have that? Can you provide the error log of DataPrepper, that contains the "index missing" message?
About Alias: I didn't touch the alias. Only removed index.
Log message: I think it said that the index is missing. I'll reproduce the bug again later when I have time and paste the exact log message here.
My config:
- KIND cluster (0.23.0)
- Opensearch started with the operator. (v2.16) (latest)
- Dataprepper started with the helmchart. (simple deployment, 1 replica). (helmchart v. 0.1.0) (latest)
- Dataprepper configured according to documentation. (otel_metrics_source + otel_traces_source + otel_logs_source). Basic and vanilla OTEL config.
gRPC is sent from the "OpenTelemetry collector pod" -> to -> "the dataprepper pod".
Just normal basic otel stuff. Basically everything is default, latest version as we speak. And Everything works. (Except that one thing.)
Everything works and if you go into the Opensearch GUI you will see the "otel-v1-apm-span-000001" index. Delete this index and it will never be recreated again. Only by restarting dataprepper will it be recreated.
The servicemaps index seems buggy too if the the "otel-v1-apm-span-000001"-index gets removed. If both are removed, none of them are coming back. This might explain why the rollover for that other person didn't work.
If you remove the metrics index they get recreated. If you remove the logs index, it gets recreated.
My investigation so far: How come it can re-create logs and metrics indexes but not spans? Makes no sense. (Jaeger and prometheus successfully received the same spans, so the traces are good!). Also, I completely failed sending trace data directly from the "OtelCollector" -> to -> "Opensearch", which was strange too. Maybe the errors are on opensearch level? It cannot be the OtelColellector because its cooperating well with other apps! Has anyone managed to use the opentelemetry collector with opensearch directly? A final OTEL irritation is that sometimes the service_maps are sent via metrics (this is an industry standard within the grafana stack using Tempo). This service_map problem is however a separate issue. The "tracing-index-recreation-falure" a serious risk for severe longterm dataloss if someone accidentally removes this single index. It might even be a severe security issue if XDR and other agents are relying on opensearch data! Alerts and anomaly detections will then not be triggered if they depend on this index unless dataprepper is restarted!! An attacker only then has to remove this index to disable the entire security pipeline and hope nobody restarts dataprepper. I have not tested dataprepper OTEL features with higher dataprepper replica counts.
Summary: Steps to reproduce bug: Just send traces to opensearch and try removing the span index via the gui. Indexes never get re-created.
The difference between OTEL logs/metrics and traces comes from the index setup as mentioned by @KarstenSchnitter.
- Logs/metrics: Data Prepper ingests into an index. If the index is not there the index will simply be created (due to the behavior of the OpenSearch bulk API).
- Trace spans: Initially, Data Prepper creates the
otel-v1-apm-span-000001index and maps it to an index aliasotel-v1-apm-span. When data is ingested, Data Prepper ingests into the index alias, which will point to the underlying index. If allotel-v1-apm-span-*indices (or maybe just the current write index) gets deleted then the alias cannot be resolved to an index anymore when data is ingested. I assume that custom logic would be needed in order to recreate the current write index, since it depends on whichotel-v1-apm-span-*indices are already existing inside the cluster. - Trace service map: Since this is only a single index (without alias), it would probably be easy to achieve recreation during runtime.
The question is why you are deleting the current write index (otel-v1-apm-span-XXXXXX)?
If you want to delete the data of this index perform a manual index rollover after which you can safely delete the original index.
There is ongoing work to move towards the index alias/rollover approach for logs/metrics as well with https://github.com/opensearch-project/data-prepper/pull/3929.