vector icon indicating copy to clipboard operation
vector copied to clipboard

DNSTAP socket shows errors after operating for some time

Open johnhtodd opened this issue 8 months ago • 3 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After a period of time, a heavily-loaded DNSTAP ingestion system shows socket errors on connected sockets that are transmitting DNSTAP data to it. Uncertain if this is a reporting error or a real error.

Configuration

in this config, I use variables for configuration.  
VECTORIPV4=10.10.3.100
VECTORDNSTAPPORT=59001


  main_dnstap_1:
    type: "dnstap"
    mode: tcp
    address: "${VECTORIPV4:?err}:${VECTORDNSTAPPORT:?err}"
    permit_origin: ["${VECTORIPV4NETWORK:?err}", "127.0.0.1/32"]
    lowercase_hostnames: true

Version

vector 0.39.0 (x86_64-unknown-linux-gnu)

Debug Output

This is not a crash; backtrace not included, though if desired I can attempt it.

2024-06-24T15:18:42.687252Z ERROR source{component_kind="source" component_id=main_dnstap_1 component_type=dnstap}:connection{peer_addr=10.10.3.232:35402}:connection: vector::internal_events::tcp: TCP socket error. error=bytes remaining on stream peer_addr=10.10.3.232:35402 error_type="connection_failed" stage="processing" internal_log_rate_limit=true
2024-06-24T15:18:42.687482Z ERROR source{component_kind="source" component_id=main_dnstap_1 component_type=dnstap}:connection{peer_addr=10.10.3.232:35402}:connection: vector::internal_events::tcp: Internal log [TCP socket error.] is being suppressed to avoid flooding.

Example Data

No response

Additional Context

This particular system I am testing on has two high-volume (>14kqps each) ingestion streams and two low-volume streams (~30qps each) connected to two different dnsdist instances and feeding two different contexts. After some period of time, errors will jump from zero to around 180,000 per second (which doesn't make sense? How can there be more errors than ingestion elements? I'm graphing with: "irate(vector_component_errors_total[5m])" in Prometheus/Grafana) on the high-volume context that is importing from the dnstap source. This I suspect is one of the two server sockets showing signs of the problem. Then after a random number of hours (often measured in days) the number of errors will jump to around double that amount. (see graph, which shows one stream as being "bad" for several days, then a spike when the other large stream starts showing errors.)

Strangely, I see no evidence of this increased error rate on any of the downstream components, either looking in their graph data (I graph pretty much everything coming out of Vector) nor on the actual output generated at the end of the pipelines. Are these errors real? The debug messages certainly seem to indicate that there is a problem.

Other items of note: Reloading vector does not cure the problem. Even more unusually, "systemctl restart vector" also does not cure the problem. Only "systemctl stop vector;systemctl start vector" causes the error graph to drop to zero and the error messages to stop being generated.

This machine is fairly heavily loaded, performing only Vector tasks (>85% utilization on htop across all cores at some parts of the day.)

There are other DNSTAP parsing errors seen sporadically, but they seem to be related to malformed DNS packets or elements in the DNSTAP message that are not yet parsed fully. I did not include those log lines.

I have other protobuf sockets operating on this system that are fairly busy (four at ~3kqps each) but which are not DNSTAP. I also have many v6 kafka streams as sinks.

Screen Shot 2024-06-24 at 8 38 08 AM

References

No response

johnhtodd avatar Jun 26 '24 03:06 johnhtodd