vector
vector copied to clipboard
feat(loki sink): add support for structured metadata
See issue: https://github.com/vectordotdev/vector/issues/18876
Initial attempt on this, but seeking some feedback to wrap up.
How can we deal with the change in protocol, as structured metadata is part of a newer protocol and this PR overrides the current proto for Loki.
More precisely, see the first two commits of this PR for the proto changes: https://github.com/vectordotdev/vector/pull/20576/files/20285e7544fe57312f4d170dfda9e2e6b5672e31
How extensive do we need tests to be, do we want ({ labels, no labels, dynamic labels} x { structured metadata, no structured metadata, dynamic structured metadata}) or is the addition of tests that only cover the new code sufficient?
As far as the question around changes to the Protocol Buffers definition -- and I haven't look at the changes yet -- I think it should be fine to use them as-is, so long as the changes as all backwards-compatible.
The addition of new fields to existing message types should be fine, as older servers should just ignore them, as far as I understand. I'm not sure how most gRPC server implementations handle requests for unknown gRPC services, though, if we're dealing with new RPC calls.
Hi, is there any update in this PR? I'm waiting for this feature
Hi, is there any update in this PR? I'm waiting for this feature
My bad, I had a lot of other work to attend to. Will pick this up Thursday.
Hi there, it's great to see the progress on this PR for adding support for structured metadata in the Loki sink! The addition of structured metadata is a significant enhancement that will greatly benefit users.
I understand there are some concerns regarding the changes in protocol and the extent of testing required. While the changes in Protocol Buffers definitions seem to be safe and backwards-compatible, it is crucial to ensure thorough testing to cover various scenarios like labels, structured metadata, and dynamic structured metadata.
Addressing the feedback provided by tobz regarding documentation and code cleanup will further improve the overall quality of the implementation. Making the necessary adjustments in the code and documentation as suggested will help in achieving a more cohesive and maintainable solution.
Given the importance of this feature and the work put into it, I strongly encourage the team to review and merge this PR as soon as possible. This feature is eagerly awaited by users and will undoubtedly enhance the capabilities of the Loki sink in Vector. Let's work together to finalize the remaining tasks and get this valuable enhancement merged promptly. Thank you for your hard work on this PR!
@tobz thanks a ton for the feedback, I think I processed everything but this is a first real rust contribution for me so apologies for it being messy here and there.
I kept the implementation as similar to the labels as I saw possible but that probably breaks any DRY criteria as it's basically the same code done over. Is it preferred if I refactor these into a single function e.g. convertTemplatedMaptoVec and call that from the respective functions?
@tobz I'll pick up the things that turn up from the CI tomorrow morning, if there's anything else lmk!
With regards to some of the errors, does it help if I do a back merge / rebase?
@tobz any news here, please lmk if I can do anything!
@maxboone Apologies. I swore I had previously approved this, but it seems not!
LGTM overall, but it still needs a changelog entry. You can see the README for the changelog directory -- https://github.com/vectordotdev/vector/tree/master/changelog.d -- which explains what is required.
LGTM overall, but it still needs a changelog entry.
@tobz Will do, thanks a ton for the review!
@tobz added the changelog entry, should be good to go :)
@pront @tobz I see some CI failing, is there something I can change / update (or extend the tests and run integration tests on my end)?
I haven't tested it against a local instance of Loki recently, if I need to set that up to ensure it works properly lmk!
@pront @tobz I see some CI failing, is there something I can change / update (or extend the tests and run integration tests on my end)?
I haven't tested it against a local instance of Loki recently, if I need to set that up to ensure it works properly lmk!
It looks like the integration tests are failing to compile. You should be able to get the same errors locally by running make check-clippy. You can also run the integration tests locally via make test-integration-loki.
It looks like the integration tests are failing to compile. You should be able to get the same errors locally by running
make check-clippy. You can also run the integration tests locally viamake test-integration-loki.
Will do, will ping again if something pops up :)
I see some messages popping up from running clippy and the integration tests here. I really thought I had fixed those but reality seems different. I'll fix this in the morning (CEST) and poke again.
It's mostly just the protobuf body that has changed and thus the unit test failing.
I see some messages popping up from running clippy and the integration tests here. I really thought I had fixed those but reality seems different. I'll fix this in the morning (CEST) and poke again.
It's mostly just the protobuf body that has changed and thus the unit test failing.
np! Happy to review again afterwards.
@jszwedko integration tests ran correctly with the latest changes, mainly added #[serde(default)] to structured_metadata in the LokiConfig, as the tests were failing on only the configuration and I personally would want to avoid adding a breaking change to existing configurations.
Anyways, as far as I can replicate, the CI should succeed now!
Great, thanks @maxboone ! Let me trigger CI here.
👍
@maxboone I think it just needs a docs regeneration using make generate-component-docs so that the tests pass
@maxboone I think it just needs a docs regeneration using
make generate-component-docsso that the tests pass
good call! will do in a minute
@maxboone Could you run make generate-component-docs?
I was in the area so I regenerated them.
good call! will do in a minute
Yeah, unfortunately WSL2 decided to drop out that afternoon after an non-downgradeable update. But it's working again - thanks @jszwedko for generating the docs <3
Regression Detector Results
Run ID: 0fe18c1f-c08e-4ab0-850a-53ffad0cbaad Metrics dashboard
Baseline: 06ad674ff4fbf2df82c50b787c45316c5b5c8d50 Comparison: d174d55fadaaee8e8a55223f43a6000e2814382d
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
No significant changes in experiment optimization goals
Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%
There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
Experiments ignored for regressions
Regressions in experiments with settings containing erratic: true are ignored.
| perf | experiment | goal | Δ mean % | Δ mean % CI | links |
|---|---|---|---|---|---|
| ➖ | file_to_blackhole | egress throughput | +1.09 | [-5.81, +7.98] |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | links |
|---|---|---|---|---|---|
| ➖ | socket_to_socket_blackhole | ingress throughput | +3.15 | [+3.07, +3.24] | |
| ➖ | otlp_http_to_blackhole | ingress throughput | +2.21 | [+2.09, +2.33] | |
| ➖ | syslog_log2metric_tag_cardinality_limit_blackhole | ingress throughput | +1.97 | [+1.88, +2.06] | |
| ➖ | syslog_loki | ingress throughput | +1.16 | [+1.08, +1.24] | |
| ➖ | datadog_agent_remap_blackhole_acks | ingress throughput | +1.11 | [+0.99, +1.23] | |
| ➖ | file_to_blackhole | egress throughput | +1.09 | [-5.81, +7.98] | |
| ➖ | http_elasticsearch | ingress throughput | +0.94 | [+0.76, +1.12] | |
| ➖ | syslog_regex_logs2metric_ddmetrics | ingress throughput | +0.93 | [+0.77, +1.09] | |
| ➖ | fluent_elasticsearch | ingress throughput | +0.86 | [+0.37, +1.36] | |
| ➖ | datadog_agent_remap_blackhole | ingress throughput | +0.62 | [+0.53, +0.70] | |
| ➖ | syslog_log2metric_humio_metrics | ingress throughput | +0.60 | [+0.46, +0.73] | |
| ➖ | syslog_humio_logs | ingress throughput | +0.34 | [+0.23, +0.45] | |
| ➖ | syslog_log2metric_splunk_hec_metrics | ingress throughput | +0.22 | [+0.12, +0.31] | |
| ➖ | otlp_grpc_to_blackhole | ingress throughput | +0.21 | [+0.10, +0.32] | |
| ➖ | http_to_http_noack | ingress throughput | +0.19 | [+0.10, +0.29] | |
| ➖ | http_to_http_json | ingress throughput | +0.03 | [-0.01, +0.07] | |
| ➖ | splunk_hec_to_splunk_hec_logs_acks | ingress throughput | +0.02 | [-0.08, +0.12] | |
| ➖ | splunk_hec_to_splunk_hec_logs_noack | ingress throughput | +0.01 | [-0.08, +0.11] | |
| ➖ | splunk_hec_indexer_ack_blackhole | ingress throughput | +0.01 | [-0.07, +0.08] | |
| ➖ | datadog_agent_remap_datadog_logs_acks | ingress throughput | -0.04 | [-0.20, +0.12] | |
| ➖ | http_to_s3 | ingress throughput | -0.08 | [-0.36, +0.19] | |
| ➖ | datadog_agent_remap_datadog_logs | ingress throughput | -0.49 | [-0.71, -0.28] | |
| ➖ | http_to_http_acks | ingress throughput | -0.68 | [-1.95, +0.58] | |
| ➖ | splunk_hec_route_s3 | ingress throughput | -1.12 | [-1.42, -0.81] | |
| ➖ | http_text_to_http_json | ingress throughput | -1.25 | [-1.37, -1.14] | |
| ➖ | syslog_splunk_hec_logs | ingress throughput | -2.42 | [-2.57, -2.27] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".