vector icon indicating copy to clipboard operation
vector copied to clipboard

Vector stops processing http source requests after 12-24 hours of production run

Open diegolbx opened this issue 2 years ago • 4 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We are running Vector. latest 0.20 version, on ubuntu 22.04 lxd container. Vector is configured with 3 different types of sources (syslog + nginx + http) and 2 sinks (newrelic + humio).

After 12-24 hours of continuous operation, about 6GB/day http log injection rate, Vector http source stops responding. Process is up and syslog+nginx source/sink are running perfectly. We have about 20 clients concurrently connecting to vector http source at any moment. Factoring logging concurrency, there may be more than 100+ connections to http source at any moment.

We did a simple telnet port test on http source port and got disconnected immediately. Strace on vector while this telnet test is performed shows no activity within vector and likely a mutex deadlock?

strace: Process 1172 attached
futex(0x7f9a4ffdf040, FUTEX_WAIT_PRIVATE, 1, NULL

Configuration

We are running several sources.

  1. Syslog udp
  2. Nginx
  3. Http ndjson

When http source stops responding, both syslog and nginx sources are working along with relevant sinks. The only issue is http source tcp acceptor appears to be deadlocked? and not responding to new requests.

Below is our http config, we stripped out working syslog/nginx configs.

[api]
enabled = true

[sources.http]
type = "http"
address = "0.0.0.0:8080"
encoding = "ndjson"

[transforms.http_parse_json]
type = "remap"
inputs = [ "http" ]
source = '''
del(.path)
del(.source_type)
del(.app)
'''

[sinks.new_relic]
  type = "new_relic_logs"
  inputs = ["http_parse_json"]
  license_key = "xxx"

[sinks.humio]
inputs = ["http_parse_json"]
type = "humio_logs"
encoding.codec = "json"
host = "https://cloud.community.humio.com"
token = "xxx"

Version

0.20

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

diegolbx avatar Apr 04 '22 08:04 diegolbx

We have a written a script to monitor the HTTP source port on Vector instance every 1m and restart if 200 code is not returned on an empty json {} packet.

Notice the frequency at which the HTTP vector soure locks up. Based on our logs the lockups are much more frequent than previous thought at a rate of 1 per hour on avg.

@jszwedko We don't know what more info we can provide to the vector team can fix this pretty serious bug. We are willing to compile custom debug builds if required. Just let us know.

The vector HTTP source is handling about 15GB of uncompressed json on a daily basis for reference. Again, non-HTTP, SYSLOG and other sources are not affected. They keep working even when HTTP source locks up.

EDIT: We just updated to 0.21 release and holding to a thread of hope this version may fix our issue.

Tue Apr 19 01:11:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 01:12:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 02:51:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 02:52:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 04:26:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:52:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 07:43:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 07:44:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:02:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:03:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:04:02 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:28:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:29:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:05:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:06:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:06:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 03:10:02 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 03:11:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:01:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:02:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:32:01 PM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 12:19:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 12:20:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:16:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:32:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:33:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 03:21:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 03:22:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 04:11:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 04:12:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:05:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:06:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:07:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 05:48:01 PM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 05:49:01 PM UTC 2022 vector http died: 000 restart vector
Thu Apr 21 12:41:01 AM UTC 2022 vector http died: 000 restart vector

Qubitium avatar Apr 21 '22 02:04 Qubitium

After updating to 0.21 and observing for few hours we can verify no changes and HTTP source port hanging issue still persists at random intervals.

Qubitium avatar Apr 21 '22 10:04 Qubitium

Hi @diegomontoya !

Thanks for this report. We haven't yet been able to dig into it, but I appreciate your willingness to run a custom build for debugging. If we aren't able to reproduce, that would definitely be helpful.

jszwedko avatar Apr 22 '22 15:04 jszwedko

I believe I am running into this or a very similar bug as well, but with a kafka source. I'm running v0.23.3 with a few bugfix patches that I backported to the 0.23.x branch.

strace shows the same output:

strace: Process 18730 attached
futex(0x7f52b7a79248, FUTEX_WAIT_PRIVATE, 1, NULL

One vector thread is stuck at 100% CPU, the source is not polling kafka to consume new messages, and the client eventually times out. So it looks like the source future stops getting polled for some reason.

I have a similar remap transform as shown in the OP, and a kafka sink. Might there be a mutex somewhere along the topology that is somehow not getting released? Very happy to try patches or investigate a fix if someone can point me in a direction to look in the code!

jches avatar Sep 21 '22 14:09 jches

The problem described here seems similar to the behavior I see with a vector sink as well https://github.com/vectordotdev/vector/issues/14932.

scirner22 avatar Oct 27 '22 02:10 scirner22

Excluding the latest comment, there appears to have been two distinct reported occurrences of this issue.

The common denominator between them is the remap transform.

neuronull avatar Feb 07 '23 00:02 neuronull

Same problem here, with a journald source and a datadog sink, after a while we are experiencing a "dead lock" :

# strace -p 2451662
strace: Process 2451662 attached
futex(0x7fdaa90d84e0, FUTEX_WAIT_PRIVATE, 1, NULL

restarting the service doesn't help.

arouene avatar Mar 24 '23 10:03 arouene

I was able to resolve my problem by reducing the batch size in the sink configuration. Here's the updated configuration that worked for me:

[sinks.datadog]
type = "datadog_logs"
...
batch.max_bytes = 42500
batch.max_events = 100

It's worth noting that I have other machines with similar configurations and more logs, but they are working perfectly fine without the batch configuration. It seems that the issue is specific to this particular machine's setup :man_shrugging:

arouene avatar Mar 24 '23 12:03 arouene

I also ran into this issue with 0.28.1. It appears that the default batch settings for the datadog sink can lead to vector hanging. I set mine to batch.max_events = 10000 and it appears to have fixed the issue.

@arouene as far as why only certain machines are experiencing it, from my testing, it requires there to be a spike in events. By default, a batch is triggered when either 2 seconds have passed or 100000 events have been collected. I'm guessing the machines that weren't working have spikes of events that were resulting in whatever batching bug we're hitting is.

kinghrothgar avatar Mar 24 '23 20:03 kinghrothgar

An attempt to summarize the above: the http source causes a Vector hang (deadlock?) after receiving big spikes of events over a period of 12+ hours.

pront avatar Aug 14 '23 14:08 pront