vector Internal Metrics source does not consistently emit discarded/dropped events for s3 sink

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Internal Metrics Source seems to not emit component_errors_total and component_discarded_events_total metrics consistently for s3 sink.

We use s3 as a sink for app logs, and datadog metrics as a sink for internal metrics (s3 component info).

Setup:

I set the s3 data sink bucket to be random bucket application does not have permissions to access. I then send requests to my service to generate some logs. After logs reach max batch size, I can see the following (expected).

2024-09-26T19:46:17.619174Z ERROR sink{component_kind="sink" component_id=event_log_s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1000 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true

Stdout log Screenshot 2024-09-26 at 4 50 17 PM

I can also see that the vector.component_received_events_total and vector.component_sent_events_total line up with my expectations in datadog (~1k events, 0 events respectively) during the request period.

However, when looking at the component_errors_total and component_discarded_events_total values (in datadog), they do not behave as expected. They seem to have the following behaviors:

On service start up, before requests are sent + service is alive for a few minutes, they emit nothing. (vector.component_sent_events_total + vector.component_received_events_total both emit 0 counts during this sitting period)
At the time of the first failure to sink to s3, both component_errors_total and component_discarded_events_total show up for the first time as (0) counts and miss the failure first event. On subsequent publishes/sink to s3s, they seem to emit values as expected. (first metric in ddg @ 2024-09-26T19:47:00Z for both, only 0s emitted until first non-zero event is 10 min later for next sink)

My expectation would be that the metrics show component_discarded_events_total=1000, component_errors_total = 1 when the first event happens. I also would have expected a constant stream of absolute metrics (0,....) from app startup.

Configuration

sources:
  service_logs:
    type: exec
    command:
      - <APP_START COMMAND>
    mode: streaming

  vector_metrics:
    # Periodically capture some metrics from vector itself
    # to monitor events rec'd/dropped for s3/datadog sinks
    type: internal_metrics
    scrape_interval_secs: 60 # i've tried 30 seconds well which doesnt seem to do anything


transforms:
  augmented_vector_metrics:
    inputs:
    - "filtered_vector_metrics"
    type: remap
    source: |-
      .tags.service = get_env_var!("DD_SERVICE")
      .tags.env = get_env_var!("DD_ENV")

sinks:
  event_log_s3:
    # placeholder for validated event_logs --> s3
    type: aws_s3
    inputs:
      - "service_logs"
    bucket: "<s3_bucket_i_dont_have_perms_for>"
    encoding:
      codec: "json"
    framing:
      method: "newline_delimited"
    filename_append_uuid: true
    key_prefix: "...."
    batch:
      max_events: 1000
      timeout_secs: 600
  datadog_metrics:
  # metrics
    inputs:
      - "augmented_vector_metrics"
    type: "datadog_metrics"
    default_api_key: "${DATADOG_API_KEY}"
    default_namespace: "${DD_SERVICE}"
    batch:
      # set batch size to 1 as vector will do additional aggregation
      # within a batch
      max_events: 1

Version

0.41.1

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

Sep 26 '24 20:09 sonjali2023

Thanks for this report @sonjali2023 . It does sound like there might be an issue with the initial emission that is causing you to see a 0 value rather than 1. Regarding publishing the metric with a 0 value initially, this is would be covered by this feature request: https://github.com/vectordotdev/vector/issues/6532.

Oct 04 '24 22:10 jszwedko

Got it, thanks @jszwedko

Oct 08 '24 15:10 sonjali2023

👋 hi @sonjali2023

Wondering if you could share more of the vector log output from leading up to the first time that we logged the error/dropped.

We only emit the error/dropped event when retried have been exhausted. I'm noticing in the AWS sink's retry logic, that we have a few cases where we explicitly do not retry. But we have a catchall case where we do retry.

What I'm wondering is, maybe this scenario is falling into the case where we are retrying the request.

Whether or not we should be retrying that particular scenario is an open question, but at least knowing this bit of info would help us rule out something unexpected going on.

Thanks!

Oct 11 '24 19:10 neuronull

👋 hi @sonjali2023

Wondering if you could share more of the vector log output from leading up to the first time that we logged the error/dropped.

We only emit the error/dropped event when retried have been exhausted. I'm noticing in the AWS sink's retry logic, that we have a few cases where we explicitly do not retry. But we have a catchall case where we do retry.

What I'm wondering is, maybe this scenario is falling into the case where we are retrying the request.

Whether or not we should be retrying that particular scenario is an open question, but at least knowing this bit of info would help us rule out something unexpected going on.

Thanks!

Hi @neuronull, thanks for the response! How much info/what info are you looking for from the vector logs? I can try to reproduce again once you let me know.

I'm actually wondering if this more of an issue related to the datadog metric sink + vector internal metric integration implementation? I briefly mentioned in my original post that I can see the expected metric in the stdout log but not reflected in the datadog metrics.

Oct 11 '24 22:10 sonjali2023

Hey @sonjali2023 sorry for the delayed response! Thanks for calling that out, I missed that in my initial read on the issue. At a glance, this could be related to https://github.com/vectordotdev/vector/issues/13863.

But I think the next step here is to do a reproduction locally and see if we can get to the bottom of where that discrepancy is coming from.

I can say it is on our radar 🙏

Oct 21 '24 17:10 neuronull

Hey @sonjali2023 sorry for the delayed response! Thanks for calling that out, I missed that in my initial read on the issue. At a glance, this could be related to #13863.

But I think the next step here is to do a reproduction locally and see if we can get to the bottom of where that discrepancy is coming from.

I can say it is on our radar 🙏

Thanks for the response @neuronull! That totally makes sense. Missed this issue when I searching.

Oct 21 '24 18:10 sonjali2023

vector vector copied to clipboard

Internal Metrics source does not consistently emit discarded/dropped events for s3 sink

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

vector
vector copied to clipboard