vector
vector copied to clipboard
Internal Metrics source does not consistently emit discarded/dropped events for s3 sink
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Internal Metrics Source seems to not emit component_errors_total and component_discarded_events_total metrics consistently for s3 sink.
We use s3 as a sink for app logs, and datadog metrics as a sink for internal metrics (s3 component info).
Setup:
I set the s3 data sink bucket to be random bucket application does not have permissions to access. I then send requests to my service to generate some logs. After logs reach max batch size, I can see the following (expected).
2024-09-26T19:46:17.619174Z ERROR sink{component_kind="sink" component_id=event_log_s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1000 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
Stdout log
I can also see that the vector.component_received_events_total and vector.component_sent_events_total line up with my expectations in datadog (~1k events, 0 events respectively) during the request period.
However, when looking at the component_errors_total and component_discarded_events_total values (in datadog), they do not behave as expected. They seem to have the following behaviors:
- On service start up, before requests are sent + service is alive for a few minutes, they emit nothing. (
vector.component_sent_events_total+vector.component_received_events_totalboth emit 0 counts during this sitting period) - At the time of the first failure to sink to s3, both
component_errors_totalandcomponent_discarded_events_totalshow up for the first time as (0) counts and miss the failure first event. On subsequent publishes/sink to s3s, they seem to emit values as expected. (first metric in ddg @ 2024-09-26T19:47:00Z for both, only 0s emitted until first non-zero event is 10 min later for next sink)
My expectation would be that the metrics show component_discarded_events_total=1000, component_errors_total = 1 when the first event happens. I also would have expected a constant stream of absolute metrics (0,....) from app startup.
Configuration
sources:
service_logs:
type: exec
command:
- <APP_START COMMAND>
mode: streaming
vector_metrics:
# Periodically capture some metrics from vector itself
# to monitor events rec'd/dropped for s3/datadog sinks
type: internal_metrics
scrape_interval_secs: 60 # i've tried 30 seconds well which doesnt seem to do anything
transforms:
augmented_vector_metrics:
inputs:
- "filtered_vector_metrics"
type: remap
source: |-
.tags.service = get_env_var!("DD_SERVICE")
.tags.env = get_env_var!("DD_ENV")
sinks:
event_log_s3:
# placeholder for validated event_logs --> s3
type: aws_s3
inputs:
- "service_logs"
bucket: "<s3_bucket_i_dont_have_perms_for>"
encoding:
codec: "json"
framing:
method: "newline_delimited"
filename_append_uuid: true
key_prefix: "...."
batch:
max_events: 1000
timeout_secs: 600
datadog_metrics:
# metrics
inputs:
- "augmented_vector_metrics"
type: "datadog_metrics"
default_api_key: "${DATADOG_API_KEY}"
default_namespace: "${DD_SERVICE}"
batch:
# set batch size to 1 as vector will do additional aggregation
# within a batch
max_events: 1
Version
0.41.1
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
Thanks for this report @sonjali2023 . It does sound like there might be an issue with the initial emission that is causing you to see a 0 value rather than 1. Regarding publishing the metric with a 0 value initially, this is would be covered by this feature request: https://github.com/vectordotdev/vector/issues/6532.
Got it, thanks @jszwedko
👋 hi @sonjali2023
Wondering if you could share more of the vector log output from leading up to the first time that we logged the error/dropped.
We only emit the error/dropped event when retried have been exhausted. I'm noticing in the AWS sink's retry logic, that we have a few cases where we explicitly do not retry. But we have a catchall case where we do retry.
What I'm wondering is, maybe this scenario is falling into the case where we are retrying the request.
Whether or not we should be retrying that particular scenario is an open question, but at least knowing this bit of info would help us rule out something unexpected going on.
Thanks!
👋 hi @sonjali2023
Wondering if you could share more of the vector log output from leading up to the first time that we logged the error/dropped.
We only emit the error/dropped event when retried have been exhausted. I'm noticing in the AWS sink's retry logic, that we have a few cases where we explicitly do not retry. But we have a catchall case where we do retry.
What I'm wondering is, maybe this scenario is falling into the case where we are retrying the request.
Whether or not we should be retrying that particular scenario is an open question, but at least knowing this bit of info would help us rule out something unexpected going on.
Thanks!
Hi @neuronull, thanks for the response! How much info/what info are you looking for from the vector logs? I can try to reproduce again once you let me know.
I'm actually wondering if this more of an issue related to the datadog metric sink + vector internal metric integration implementation? I briefly mentioned in my original post that I can see the expected metric in the stdout log but not reflected in the datadog metrics.
Hey @sonjali2023 sorry for the delayed response! Thanks for calling that out, I missed that in my initial read on the issue. At a glance, this could be related to https://github.com/vectordotdev/vector/issues/13863.
But I think the next step here is to do a reproduction locally and see if we can get to the bottom of where that discrepancy is coming from.
I can say it is on our radar 🙏
Hey @sonjali2023 sorry for the delayed response! Thanks for calling that out, I missed that in my initial read on the issue. At a glance, this could be related to #13863.
But I think the next step here is to do a reproduction locally and see if we can get to the bottom of where that discrepancy is coming from.
I can say it is on our radar 🙏
Thanks for the response @neuronull! That totally makes sense. Missed this issue when I searching.