vector icon indicating copy to clipboard operation
vector copied to clipboard

feat(loki sink): add support for structured metadata

Open maxboone opened this issue 1 year ago • 20 comments
trafficstars

See issue: https://github.com/vectordotdev/vector/issues/18876

Initial attempt on this, but seeking some feedback to wrap up.

How can we deal with the change in protocol, as structured metadata is part of a newer protocol and this PR overrides the current proto for Loki.

More precisely, see the first two commits of this PR for the proto changes: https://github.com/vectordotdev/vector/pull/20576/files/20285e7544fe57312f4d170dfda9e2e6b5672e31

How extensive do we need tests to be, do we want ({ labels, no labels, dynamic labels} x { structured metadata, no structured metadata, dynamic structured metadata}) or is the addition of tests that only cover the new code sufficient?

maxboone avatar May 29 '24 14:05 maxboone

CLA assistant check
All committers have signed the CLA.

bits-bot avatar May 29 '24 14:05 bits-bot

As far as the question around changes to the Protocol Buffers definition -- and I haven't look at the changes yet -- I think it should be fine to use them as-is, so long as the changes as all backwards-compatible.

The addition of new fields to existing message types should be fine, as older servers should just ignore them, as far as I understand. I'm not sure how most gRPC server implementations handle requests for unknown gRPC services, though, if we're dealing with new RPC calls.

tobz avatar May 31 '24 14:05 tobz

Hi, is there any update in this PR? I'm waiting for this feature

kimxogus avatar Jul 10 '24 04:07 kimxogus

Hi, is there any update in this PR? I'm waiting for this feature

My bad, I had a lot of other work to attend to. Will pick this up Thursday.

maxboone avatar Jul 10 '24 06:07 maxboone

Hi there, it's great to see the progress on this PR for adding support for structured metadata in the Loki sink! The addition of structured metadata is a significant enhancement that will greatly benefit users.

I understand there are some concerns regarding the changes in protocol and the extent of testing required. While the changes in Protocol Buffers definitions seem to be safe and backwards-compatible, it is crucial to ensure thorough testing to cover various scenarios like labels, structured metadata, and dynamic structured metadata.

Addressing the feedback provided by tobz regarding documentation and code cleanup will further improve the overall quality of the implementation. Making the necessary adjustments in the code and documentation as suggested will help in achieving a more cohesive and maintainable solution.

Given the importance of this feature and the work put into it, I strongly encourage the team to review and merge this PR as soon as possible. This feature is eagerly awaited by users and will undoubtedly enhance the capabilities of the Loki sink in Vector. Let's work together to finalize the remaining tasks and get this valuable enhancement merged promptly. Thank you for your hard work on this PR!

devkoriel avatar Jul 22 '24 04:07 devkoriel

@tobz thanks a ton for the feedback, I think I processed everything but this is a first real rust contribution for me so apologies for it being messy here and there.

I kept the implementation as similar to the labels as I saw possible but that probably breaks any DRY criteria as it's basically the same code done over. Is it preferred if I refactor these into a single function e.g. convertTemplatedMaptoVec and call that from the respective functions?

maxboone avatar Jul 22 '24 07:07 maxboone

@tobz I'll pick up the things that turn up from the CI tomorrow morning, if there's anything else lmk!

With regards to some of the errors, does it help if I do a back merge / rebase?

maxboone avatar Aug 01 '24 17:08 maxboone

@tobz any news here, please lmk if I can do anything!

maxboone avatar Aug 15 '24 14:08 maxboone

@maxboone Apologies. I swore I had previously approved this, but it seems not!

LGTM overall, but it still needs a changelog entry. You can see the README for the changelog directory -- https://github.com/vectordotdev/vector/tree/master/changelog.d -- which explains what is required.

tobz avatar Aug 15 '24 14:08 tobz

LGTM overall, but it still needs a changelog entry.

@tobz Will do, thanks a ton for the review!

maxboone avatar Aug 15 '24 14:08 maxboone

@tobz added the changelog entry, should be good to go :)

maxboone avatar Aug 16 '24 05:08 maxboone

@pront @tobz I see some CI failing, is there something I can change / update (or extend the tests and run integration tests on my end)?

I haven't tested it against a local instance of Loki recently, if I need to set that up to ensure it works properly lmk!

maxboone avatar Aug 16 '24 18:08 maxboone

@pront @tobz I see some CI failing, is there something I can change / update (or extend the tests and run integration tests on my end)?

I haven't tested it against a local instance of Loki recently, if I need to set that up to ensure it works properly lmk!

It looks like the integration tests are failing to compile. You should be able to get the same errors locally by running make check-clippy. You can also run the integration tests locally via make test-integration-loki.

jszwedko avatar Aug 16 '24 18:08 jszwedko

It looks like the integration tests are failing to compile. You should be able to get the same errors locally by running make check-clippy. You can also run the integration tests locally via make test-integration-loki.

Will do, will ping again if something pops up :)

maxboone avatar Aug 16 '24 19:08 maxboone

I see some messages popping up from running clippy and the integration tests here. I really thought I had fixed those but reality seems different. I'll fix this in the morning (CEST) and poke again.

It's mostly just the protobuf body that has changed and thus the unit test failing.

maxboone avatar Aug 16 '24 21:08 maxboone

I see some messages popping up from running clippy and the integration tests here. I really thought I had fixed those but reality seems different. I'll fix this in the morning (CEST) and poke again.

It's mostly just the protobuf body that has changed and thus the unit test failing.

np! Happy to review again afterwards.

jszwedko avatar Aug 16 '24 22:08 jszwedko

@jszwedko integration tests ran correctly with the latest changes, mainly added #[serde(default)] to structured_metadata in the LokiConfig, as the tests were failing on only the configuration and I personally would want to avoid adding a breaking change to existing configurations.

Anyways, as far as I can replicate, the CI should succeed now!

maxboone avatar Aug 17 '24 06:08 maxboone

Great, thanks @maxboone ! Let me trigger CI here.

jszwedko avatar Aug 19 '24 22:08 jszwedko

👍

kimxogus avatar Aug 28 '24 06:08 kimxogus

@maxboone I think it just needs a docs regeneration using make generate-component-docs so that the tests pass

miadabrin avatar Aug 28 '24 13:08 miadabrin

@maxboone I think it just needs a docs regeneration using make generate-component-docs so that the tests pass

good call! will do in a minute

maxboone avatar Aug 28 '24 14:08 maxboone

@maxboone Could you run make generate-component-docs?

kimxogus avatar Sep 02 '24 02:09 kimxogus

I was in the area so I regenerated them.

jszwedko avatar Sep 03 '24 19:09 jszwedko

good call! will do in a minute

Yeah, unfortunately WSL2 decided to drop out that afternoon after an non-downgradeable update. But it's working again - thanks @jszwedko for generating the docs <3

maxboone avatar Sep 03 '24 19:09 maxboone

Regression Detector Results

Run ID: 0fe18c1f-c08e-4ab0-850a-53ffad0cbaad Metrics dashboard

Baseline: 06ad674ff4fbf2df82c50b787c45316c5b5c8d50 Comparison: d174d55fadaaee8e8a55223f43a6000e2814382d

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI links
file_to_blackhole egress throughput +1.09 [-5.81, +7.98]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI links
socket_to_socket_blackhole ingress throughput +3.15 [+3.07, +3.24]
otlp_http_to_blackhole ingress throughput +2.21 [+2.09, +2.33]
syslog_log2metric_tag_cardinality_limit_blackhole ingress throughput +1.97 [+1.88, +2.06]
syslog_loki ingress throughput +1.16 [+1.08, +1.24]
datadog_agent_remap_blackhole_acks ingress throughput +1.11 [+0.99, +1.23]
file_to_blackhole egress throughput +1.09 [-5.81, +7.98]
http_elasticsearch ingress throughput +0.94 [+0.76, +1.12]
syslog_regex_logs2metric_ddmetrics ingress throughput +0.93 [+0.77, +1.09]
fluent_elasticsearch ingress throughput +0.86 [+0.37, +1.36]
datadog_agent_remap_blackhole ingress throughput +0.62 [+0.53, +0.70]
syslog_log2metric_humio_metrics ingress throughput +0.60 [+0.46, +0.73]
syslog_humio_logs ingress throughput +0.34 [+0.23, +0.45]
syslog_log2metric_splunk_hec_metrics ingress throughput +0.22 [+0.12, +0.31]
otlp_grpc_to_blackhole ingress throughput +0.21 [+0.10, +0.32]
http_to_http_noack ingress throughput +0.19 [+0.10, +0.29]
http_to_http_json ingress throughput +0.03 [-0.01, +0.07]
splunk_hec_to_splunk_hec_logs_acks ingress throughput +0.02 [-0.08, +0.12]
splunk_hec_to_splunk_hec_logs_noack ingress throughput +0.01 [-0.08, +0.11]
splunk_hec_indexer_ack_blackhole ingress throughput +0.01 [-0.07, +0.08]
datadog_agent_remap_datadog_logs_acks ingress throughput -0.04 [-0.20, +0.12]
http_to_s3 ingress throughput -0.08 [-0.36, +0.19]
datadog_agent_remap_datadog_logs ingress throughput -0.49 [-0.71, -0.28]
http_to_http_acks ingress throughput -0.68 [-1.95, +0.58]
splunk_hec_route_s3 ingress throughput -1.12 [-1.42, -0.81]
http_text_to_http_json ingress throughput -1.25 [-1.37, -1.14]
syslog_splunk_hec_logs ingress throughput -2.42 [-2.57, -2.27]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

github-actions[bot] avatar Sep 03 '24 21:09 github-actions[bot]