Cannot index label otel object such as `SeverityText`
Describe the bug When using otel http listener for ingesting otel logs, cannot index label objects that are not in the attribute, such as "SeverityText". Yet grafana does identify the log level correctly.
To Reproduce Steps to reproduce the behavior:
- Ingest logs via otel http
- log object does not have level attribute but has SeverityText
- configure
otlp_config:
resource_attributes:
attributes_config:
- action: index_label
attributes:
- level
Which has no effect
Expected behavior An option to map otel objects as labels as well Environment:
- Infrastructure: Kubernetes
- Deployment tool: helm version 6.6.5
- Loki version 3.1.0
Screenshots, Promtail config, or terminal output Example log object
2024-07-09T08:27:57.681Z info ResourceLog #0
Resource SchemaURL:
Resource attributes:
-> telemetry.sdk.language: Str(python)
-> telemetry.sdk.name: Str(opentelemetry)
-> telemetry.sdk.version: Str(1.24.0)
-> service.name: Str(armoz-app)
-> telemetry.auto.version: Str(0.45b0)
ScopeLogs #0
ScopeLogs SchemaURL:
InstrumentationScope opentelemetry.sdk._logs._internal
LogRecord #0
ObservedTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-07-09 08:27:53.102534656 +0000 UTC
SeverityText: ERROR
SeverityNumber: Error(17)
Body: Str(Failed to export metrics to opentelemetry-collector.opentelemetry.svc.cluster.local:4317, error code: StatusCode.UNIMPLEMENTED)
Attributes:
-> otelSpanID: Str(0)
-> otelTraceID: Str(0)
-> otelTraceSampled: Bool(false)
-> otelServiceName: Str(armoz-app)
-> code.filepath: Str(/home/appuser/.local/lib/python3.10/site-packages/opentelemetry/exporter/otlp/proto/grpc/exporter.py)
-> code.function: Str(_export)
-> code.lineno: Int(306)
Trace ID:
Span ID:
Flags: 0
And in Grafana
I can confirm this as well if we take an example payload from otel:
{
"resource": {
"attributes": {
"service.name": "example-service",
"service.namespace": "example-namespace",
"service.instance.id": "instance-12345",
"service.version": "1.0.0",
"host.name": "example-host",
"host.id": "host-12345",
"cloud.provider": "aws",
"cloud.region": "us-west-2"
}
},
"instrumentationLibrary": {
"name": "example-logger",
"version": "0.1.0"
},
"logRecords": [
{
"timeUnixNano": "1625241600000000000",
"severityText": "INFO",
"severityNumber": 9,
"name": "example-log",
"body": {
"stringValue": "This is an example log message"
},
"attributes": {
"http.method": "GET",
"http.url": "https://example.com/api/resource",
"http.status_code": 200,
"db.system": "mysql",
"db.statement": "SELECT * FROM users WHERE id = ?"
},
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"flags": 1,
"droppedAttributesCount": 0
}
]
}
We only allow users to promote logRecords data to structured metadata:
log_attributes:
- action: structured_metadata
attributes:
- severityText
changing the above to
log_attributes:
- action: index_label
attributes:
- severityText
Causes Loki to fail:
level=info ts=2024-07-09T09:05:33.245743339Z caller=loki.go:506 msg="Loki stopped" running_time=21m38.185866051s
level=error ts=2024-07-09T09:05:33.509294922Z caller=main.go:70 msg="validating config" err="CONFIG ERROR: invalid limits_config config: index_label action is only supported for resource_attributes"
We should allow users to promote log attributes like severity to labels to provide better flexibility on what is indexed
I'm also supporting this. I just recently completed a migration from ingesting JSON log files to native OTLP export from a service through Alloy to Loki's OTLP endpoint.
The inability to index based on log attributes was the only issue I encountered. It's prevented me from indexing a log attribute that would be very useful to have as a label.
I spent a good amount of time looking for an alternative fix but didn't find anything useful in the end. Due to the way OTLP messages are structured, there isn't a simple way to simply promote a log attribute to being a resource attribute, otherwise I would have tried this with Alloy or in the service itself. Given that index labels are a Loki concept, it makes sense that Loki should be where index labels could be selected from any part of the OTLP payload.
Thanks @omri-shilton for reaching out. Even if some old documentation has suggested to promote the log severity field as a Loki label, it's not no longer a best practice as documented in Label best practices / Static labels are good:
From early on, we have set a label dynamically using Promtail pipelines for
level. This seemed intuitive for us as we often wanted to only show logs forlevel="error"; however, we are re-evaluating this now as writing a query.{app="loki"} |= "level=error"is proving to be just as fast for many of our applications as{app="loki",level="error"}.
Can you and all the users who are asking for this capability please explain us your use case and why capturing the OpenTelemetry Log field SeverityText as a Loki metadata is a challenge for you? cc @JoelDavidLang
Note that with Loki metadata, the query will look like:
{service_name="frontend"} | severity_text="INFO"
@cyrille-leclerc maybe is not the best of the reasons but when someone is writing this
{app="loki"}
is very natural to try and write
{app="loki",level="error"}
also Grafana currently would autocomplete values when you write this.
But not being an index now you need to figure that this totally different syntax works.
{app="loki"} |= "level=error"
One can ask why like this, why not:
{level="error"} |= "app=loki"
is a couple of attributes and values been equaled, why here works or there works, it is a mystery for people who just want to do some search and do not want to learn the details of how loki stores what where.
Related to discoverability, level is known but for other kind of attributes maybe you do not know how to spell it, much less the value you are looking for. Autocomplete is very useful for beginner users.
Can you and all the users who are asking for this capability please explain us your use case and why capturing the OpenTelemetry Log field
SeverityTextas a Loki metadata is a challenge for you? cc @JoelDavidLang
This isn't about SeverityText in my case. I have a kind attribute that categorizes log messages in a way relevant to the service and how users would want to consume the logs from Grafana. It makes complete sense to have this be an index label as there are only two kinds at the moment and if you're interested in one, you're unlikely to be interested in the other for a given query. Before I migrated to OTLP, this index label had noticeably improved the performance of both the Grafana dashboard and manual searches through the logs using Grafana's Explore feature.
Unfortunately, due to this OTLP ingestion limitation, this is can no longer be an index label. Loki queries take longer than they would have before. It's a frustrating downgrade from what was otherwise a smooth transition to OTLP.
There is also a question of log retention. Can log retention rules even use structured metadata, or only index labels? The Loki documentation only says label matchers can be used. I have reason to want to change the retention time based on this kind attribute.
I've also considered customizing retention based on severity. I wouldn't want to have one index label value per severity level, so I'd probably create a severity_group attribute where debug and trace messages have one value, and all other messages have another value. This would let me further reduce retention of noisier debug and trace logs while more valuable info, warning, and error logs are retained longer. But there is no point anymore since I wouldn't be able to index based on this.
I've investigated adding attributes from the log to the resource either in the application or in Grafana Alloy, but this doesn't seem possible due to how the OTLP data is structured.
Unfortunately, I'm having to consider the possibility of reverting my OTLP migration. I'd go back to writing JSON log files and processing them with Alloy. It's unnecessary work, and I'll lose the automatic structured metadata, but I'd have control of index labels again. I don't want to release the OTLP change into production and find that it's become a blocking issue.
Other good reasons to let user use log attributes as index labels:
- You can't create Grafana dashboard variables based on structured metadata. Most developers don't want to write logQL queries, they just want to use log dashboards.
- You can't use structured metadata with ad-hoc query variables in Grafana.
- Makes migration from non-OTLP pipelines easier because you don't have to change your Grafana dashboards.
- Most likely people will try to find workarounds to get their index labels back.
- Setting log attributes as resource attributes as a workaround is not a good option since it looks like OTel batch processor assumes resource attributes are shared for optimizations (which makes sense). This is an issue because log severity is not shared among logs in the batch.
- Labels such as log level are low cardinality labels. OTel context attribute keys are arbitrary, please let user decide what is worth using as index label.
Thanks, that's very helpful. cc @edspace
+1
+1
otlp_config:
resource_attributes:
attributes_config:
- action: index_label
regex: severity_text
I used this configuration and successfully set severity_text as an index label.
Then, I used the OTEL Collector processor for preprocessing. Set to resoure attribute of log by processor from log record properties.
processors:
transform:
log_statements:
- context: log
statements:
- set(resource.attributes["severity.text"], severity_text)
- set(resource.attributes["severity.text"], "TRACE") where severity_number == 1
- set(resource.attributes["severity.text"], "DEBUG") where severity_number == 5
- set(resource.attributes["severity.text"], "INFO") where severity_number == 9
- set(resource.attributes["severity.text"], "WARN") where severity_number == 13
- set(resource.attributes["severity.text"], "ERROR") where severity_number == 17
- set(resource.attributes["severity.text"], "FATAL") where severity_number == 21
EDIT: don't rely on this, it's bugged, see after.
It also appears from my research that indexing anything from OTEL log resources as labels is pretty much impossible. In other words, resource attributes can be indexed, logRecord attributes can't. Loki says index_label action is only supported for resource_attributes when I try that.
@tedmax100 's solution is clever and minimalistic. I'm gonna share mine for simpler cases too (a regex is an overkill for a simple string match).
In OTEL:
transform:
log_statements:
- context: log
statements:
- set(resource.attributes["app.facility"], log.attributes["facility"])
pipelines:
logs:
processors:
- transform
In Loki:
limits_config:
otlp_config:
resource_attributes:
attributes_config:
- action: index_label
attributes:
- app.facility
This is an elegant enough workaround and doesn't require too many resources.
@john8329 @tedmax100 Thanks for this. I wasn't aware that resource attributes could be set from the log context.
I read the Grafana Alloy documentation and came away thinking resource level attributes could only be read, not set, from the log context. I'm happy to see that was wrong.
Well, it appears that copying attributes from the log context to the resource doesn't always take the correct logs. I'm seeing values I don't expect, so maybe we shouldn't rely on this method.
@john8329 The reason I thought this wasn't possible is because I examined the serialization format for OTel logs when I was looking at solutions. I saw it used a nested format where there was only one set of resource attributes that applied to all log messages contained within the message, rather than a copy of these resource attributes per log message.
If Alloy is not intelligently flattening resource attributes before processing, then it would make sense that the log transform setting resource attributes means they are applied to every log in the message regardless of the log condition.
In that case, doing this through Alloy is not feasible.
Check this out https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32080 I'm reading it now
@john8329 So it looks like I was correct about the issue, and a solution might come in the future, but it doesn't exist yet.
That said, the complexity of that issue highlights why this is something that should be solved in Loki, rather than in an OTel processor.
Ultimately, Loki is going to flatten the data on its own, and it would be much simpler for everyone involved if we could simply index based on log attributes as they come into Loki, as opposed to taking the potential perf hit of flattening and unflattening the data in a processor.
@JoelDavidLang you're totally right on this, and after some more research I can confirm this is a deliberate choice by Loki to keep indexes to a minimum (and maybe potentially avoid abuse where many indexes with high cardinality might completely break the system?). Loki is designed to index just the resource, and the rest is going to have a linear scan. Which means I can't even extract the field's values to populate a dropdown menu but oh well. It's good anyway.
tl.dr: Loki probably won't fix, it's a design choice. Index the least, search linearly in parallel the rest.
@john8329 I do hope this isn't a deliberate design choice.
There is no reason for Loki to assume that low cardinality attributes can't exist on individual log messages. There are many cases where they do, and it should be up to the user to determine whether or not Loki should index them.
It would also be incorrect for Loki to assume that users have control of resource attributes after an application has started, which applies in my case because I'm using .NET. It's not possible for me to set resource-level attributes per log message when using the OTelCol exporter.
Finally, as I mentioned before, Loki retention rules can only use index labels and not log metadata.
Unfortunately, because of this whole issue, I've decided not to use OTelCol in production. The inability to control index labeling in a way that suits our needs represents a risk that I'm not willing to take. This leaves me with two options:
- Continue to output JSON lines files and parse them with Alloy to deliver them using the normal Loki API.
- Create a new OTel exporter for .NET that uses the Loki API format.
I don't really like the idea of engineering a custom solution to this, so I might just stick with JSON.
@JoelDavidLang considering how Loki is designed to index only the log context and bruteforce the search of all other attributes, contrary to (for example) elasticsearch which indexes everything, it sounds reasonable. I may be wrong though.
Indeed the assumption that the user can control what goes in the OTEL resource and log is arbitrary and often wrong. This is where the collector should be able to transform messages IMO. I can't do that either because SDKs set the resource during initialization, and it's conceptually correct.
I've personally adapted to their way of organizing, but I don't like it too much, I've heard getting logs without using OTEL can offer more flexibility for these cases.
Hey all, as a quick update on this issue, we have listened to your feedback and decided unlock these attributes for label promotion. Here is the PR : https://github.com/grafana/loki/pull/16673
I suspect this will be in the next release.
@Jayclifford345: v3.5.0 just dropped and while it doesn't list #16673 in its changelog, I am seeing changes from that PR in the code such as https://github.com/grafana/loki/blob/v3.5.0/pkg/loghttp/push/otlp_config.go#L46 Missed entry in changelog?
I've been looking forward to see if the new release solves this issue, but I might not have the chance to try out v3.5.0 for a couple weeks.
Hello all! Yes, this should now be available in v3.5. Please be mindful which attributes you promote to labels in your loki setup. Setting high cardinality log attributes as labels eventually results into fragmented chunks that directly affect query performance.
Hello all! Yes, this should now be available in v3.5. Please be mindful which attributes you promote to labels in your loki setup. Setting high cardinality log attributes as labels eventually results into fragmented chunks that directly affect query performance.
I think SeverityText belongs to a low cardinality label. Let's upgrade to v3.5 and see how it performs.
I confirm the new specific setting works with loki 3.5.0:
limits_config:
otlp_config:
severity_text_as_label: true
For some reason I couldn't get that setting to work in the runtime per-tenant overrides ("field severity_text_as_label not found in type push.OTLPConfig") instead of limits_config, but the global is good enough for my setup.
Thanks @shantanualsi!