vector icon indicating copy to clipboard operation
vector copied to clipboard

Metrics Stop Prematurely at Program Shutdown

Open dmuth opened this issue 1 year ago • 2 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When I run Vector from the command line with a single data source of demo data for a few seconds, then terminate the process with ctrl-C, the values for Metrics stop being updated prematurely while events are still written.

This is my first ever Issue with Vector, so if I missed something really obvious or more troubleshooting needs to be done on my end please do let me know! :-)

Configuration

[sources.in]
type = "internal_metrics"
scrape_interval_secs = 1

[sources.in_demo]
type = "demo_logs"
format = "apache_common"
interval = 1
#interval = 5
#interval = 10


[transforms.tf]
inputs = [ "in" ]
type = "metric_to_log"


[sinks.json]
inputs = ["tf"]
type = "file"
path = "output-metrics.json"
encoding.codec = "json"


[sinks.out_demo]
inputs = ["in_demo"]
type = "file"
path = "output-demo.json"
encoding.codec = "json"

Version

vector 0.23.3 (x86_64-unknown-linux-musl af8c9e1 2022-08-10)

Debug Output

https://gist.github.com/dmuth/736ce3e8df596de6eff41376a0896556

Example Data

N/A

Additional Context

From a sample run (vector -c ./internal-metrics.toml), with interval = 1:

$ wc -l output-demo.json 
8 output-demo.json

$ cat output-metrics.json |jq -C '. | select(.name == "events_in_total" and .tags.component_name == "in_demo")
[...]
{
  "counter": {
    "value": 7
  },
  "kind": "absolute",
  "name": "events_in_total",
  "namespace": "vector",
  "tags": {
    "component_id": "in_demo",
    "component_kind": "source",
    "component_name": "in_demo",
    "component_type": "demo_logs"
  },
  "timestamp": "2023-01-11T20:52:38.202574632Z"
}

Expected value for the metrics count would be 8, matching the number of events in the output file.

When I set interval = 5 and run Vector for 6 seconds, here's what I see:

$ wc -l output-demo.json 
3 output-demo.json


$ cat output-metrics.json |jq -C '. | select(.name == "events_in_total" and .tags.component_name == "in_demo")
[...]
{
  "counter": {
    "value": 2
  },
  "kind": "absolute",
  "name": "events_in_total",
  "namespace": "vector",
  "tags": {
    "component_id": "in_demo",
    "component_kind": "source",
    "component_name": "in_demo",
    "component_type": "demo_logs"
  },
  "timestamp": "2023-01-11T20:54:29.671992619Z"
}

Watching the execution in real-time provides a clue: when I hit ctrl-C about 6 seconds in, Vector runs for another 4 seconds or so, and the third event is written to the output file. This suggests that when ctrl-C is received, metrics immediately stop being generated, buffers are flushed/pending events are processed, and then the Vector process exits.

References

None that I could find after doing a few searches here.

Thank you for your time, and all the effort that's gone into Vector so far!

-- Doug

dmuth avatar Jan 11 '23 21:01 dmuth

Hey @dmuth - looks legit! Thanks for the thorough report. I suspect what's happening (basically what you suggested) is all of the components attempt to finish and shutdown at the same time, and the internal_metrics source happens to do it faster than the demo_logs source.

I can envision a tiered shutdown where the "actual event streams" are finished, then the "internal event streams", and then finally cleanly shutdown - to not lose any of the telemetry. It could get complicated identifying/separating components into tiers though.

spencergilbert avatar Jan 11 '23 21:01 spencergilbert

@spencergilbert as an ad-hoc solution instead of implementing "true" tiering, I could suggest just dropping "internal_metrics" source as the last one. Not a pretty solution but should work :) Is it worth it? I don't know, honestly.

zamazan4ik avatar Jan 12 '23 09:01 zamazan4ik

This likely affects the internal_logs source as well as both this and internal_metrics will shut down right away when receiving the signal.

jszwedko avatar Mar 13 '23 15:03 jszwedko

Closed by https://github.com/vectordotdev/vector/pull/17741

jszwedko avatar Aug 02 '23 17:08 jszwedko