vector icon indicating copy to clipboard operation
vector copied to clipboard

Disk buffers should close at shut-down

Open anil-db opened this issue 1 year ago • 11 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

if s3 sink is using batch and during shutdown batch is not full it keep waiting for batch to become full or complete. grace period may expire earlier causing those events to be not delivered. depending of type of buffer used it would cause loss or delayed delivery and ungraceful shutdown.

Configuration

data_dir = "/tmp/vector/data"
schema.log_namespace = false
acknowledgements.enabled = true
[api]
enabled = true
[sources.dummy_logs]
type = "http_server"
address = "0.0.0.0:443"
decoding.codec = "bytes"
tls.enabled = false
[sinks.s3]
type = "aws_s3"
bucket = "my-bucket"
batch.max_events = 100000
batch.timeout_secs = 600
inputs = ["dummy_logs"]
encoding.codec = "raw_message"
buffer.max_size = 1000000000
buffer.type = "disk"

Version

35.0

Debug Output

2024-01-11T20:41:58.799877Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info"
2024-01-11T20:41:58.800877Z  INFO vector::app: Loading configs. paths=["vector.toml"]
2024-01-11T20:41:59.365666Z  INFO vector::topology::running: Running healthchecks.
2024-01-11T20:41:59.365865Z  INFO vector: Vector has started. debug="true" version="0.35.0" arch="x86_64" revision=""
2024-01-11T20:41:59.366167Z  INFO source{component_kind="source" component_id=dummy_logs component_type=http_server}: vector::sources::util::http::prelude: Building HTTP server. address=0.0.0.0:443
2024-01-11T20:41:59.367863Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=http://127.0.0.1:8686/playground
2024-01-11T20:41:59.639459Z ERROR vector::topology::builder: msg="Healthcheck failed." error=Unknown status code: 301 Moved Permanently component_kind="sink" component_type="aws_s3" component_id=s3
^C2024-01-11T20:42:09.619090Z  INFO vector::signal: Signal received. signal="SIGINT"
2024-01-11T20:42:09.619180Z  INFO vector: Vector has stopped.
2024-01-11T20:42:09.620726Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="59 seconds left"
2024-01-11T20:42:14.620626Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="54 seconds left"
2024-01-11T20:42:19.621906Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="49 seconds left"
2024-01-11T20:42:24.621188Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="44 seconds left"
2024-01-11T20:42:29.621132Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="39 seconds left"
2024-01-11T20:42:34.621181Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="34 seconds left"
2024-01-11T20:42:39.621160Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="29 seconds left"
2024-01-11T20:42:44.620830Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="24 seconds left"
2024-01-11T20:42:49.620156Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="19 seconds left"
2024-01-11T20:42:54.621015Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="14 seconds left"
2024-01-11T20:42:59.620651Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="9 seconds left"
2024-01-11T20:43:04.620839Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="4 seconds left"
2024-01-11T20:43:09.620738Z ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="s3"

Example Data

No response

Additional Context

with this config send 1 http request curl -v http://localhost:443 -d "hi" and SIGTERM the vector. graceful shutdown stuck at closing sink.

References

No response

anil-db avatar Jan 11 '24 20:01 anil-db

Thanks for the report! This is weird, because we actually validated this exact case recently as part of https://github.com/vectordotdev/vector/issues/11405.

jszwedko avatar Jan 11 '24 23:01 jszwedko

Is it possible the batches are being flushed but not being sent successfully? Running Vector with debug logs should make it more clear.

jszwedko avatar Jan 11 '24 23:01 jszwedko

Is it possible the batches are being flushed but not being sent successfully? Running Vector with debug logs should make it more clear.

yeah. batches are being flushed if we wait for batch timeout secs. if grace period finishes before that then it wont flush. I think ask is to interrupt that wait at termination. There is reason we cannot reduce batch timeout secs because that may generate lot of small files and cannot have very big grace period that affect rollouts time to upgrade whole vector cluster.

I added some of my own print statement to test. here timeout sec for batch is 3 min and there is just 1 event sent. upload to s3 is unsuccessful but that is not the cause as there is no retry in s3 and shutdown happen very fast after s3 request https://github.com/vectordotdev/vector/issues/19601

SIGINT happened at 2024-01-12T00:09:41.804809Z INFO vector::signal: Signal received. signal="SIGINT" and at batch was handed over to request builder 2024-01-12T00:12:35.405263Z INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: handing over closed batch

after that shutdown happened very fast irrespective of s3 failing.

2024-01-12T00:09:35.400610Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: batch an item
2024-01-12T00:09:35.400819Z  INFO vector_buffers::topology::channel::limited_queue: LimitedSender droping
^C2024-01-12T00:09:41.804809Z  INFO vector::signal: Signal received. signal="SIGINT"
2024-01-12T00:09:41.805043Z  INFO vector: Vector has stopped.
2024-01-12T00:09:41.805424Z  INFO source{component_kind="source" component_id=dummy_logs component_type=http_server}: vector_buffers::topology::channel::limited_queue: LimitedSender droping
2024-01-12T00:09:41.806014Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3, dummy_logs" time_remaining="no time limit"
2024-01-12T00:09:46.806849Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:09:51.806873Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:09:56.806922Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:01.808060Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:06.807093Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:11.807035Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:16.807043Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:21.807173Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:26.807283Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:31.807373Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:36.806724Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:41.807554Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:46.807464Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:51.807714Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:10:56.807798Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:01.807787Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:06.807061Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:11.808624Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:16.807667Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:21.808804Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:26.808178Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:31.807523Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:36.807014Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:41.807271Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:46.807199Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:51.808635Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:11:56.808766Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:01.807392Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:06.808010Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:11.808101Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:16.808366Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:21.808242Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:26.808299Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:31.807982Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"
2024-01-12T00:12:35.404961Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: timer expired for batch
2024-01-12T00:12:35.405263Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: handing over closed batch
2024-01-12T00:12:35.653302Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=service error internal_log_rate_limit=true
2024-01-12T00:12:35.653924Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(ServiceError(ServiceError { source: PutObjectError { kind: Unhandled(Unhandled { source: Error { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), request_id: Some("XKEWR97TNAT9TW59"), extras: {"s3_extended_request_id": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck="} } }), meta: Error { code: Some("PermanentRedirect"), message: Some("The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."), request_id: Some("XKEWR97TNAT9TW59"), extras: {"s3_extended_request_id": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck="} } }, raw: Response { inner: Response { status: 301, version: HTTP/1.1, headers: {"x-amz-bucket-region": "us-east-1", "x-amz-request-id": "XKEWR97TNAT9TW59", "x-amz-id-2": "Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Fri, 12 Jan 2024 00:12:35 GMT", "server": "AmazonS3", "connection": "close"}, body: SdkBody { inner: Once(Some(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>s3.amazonaws.com</Endpoint><Bucket>my-bucket</Bucket><RequestId>XKEWR97TNAT9TW59</RequestId><HostId>Ujv/gt/b7NTrIsMxcUk5JbklpDMSdCebsT55q59xlqC36uY2h1F7T04naarIrJLz0fe61E+h0Ck=</HostId></Error>")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } })) request_id=1 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-01-12T00:12:35.654299Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-01-12T00:12:35.654414Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: droping event finalizer
2024-01-12T00:12:35.654451Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: updating status AtomicCell { value: Delivered } Rejected
2024-01-12T00:12:35.654500Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: droping notifier
2024-01-12T00:12:35.654528Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}:request{request_id=1}: vector_common::finalization: sending drop status Rejected
2024-01-12T00:12:35.654847Z  INFO sink{component_kind="sink" component_id=s3 component_type=aws_s3}: vector_stream::partitioned_batcher: stream close see if there unfinished batch
2024-01-12T00:12:36.808033Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="s3" time_remaining="no time limit"

anil-db avatar Jan 12 '24 00:01 anil-db

Thanks for the additional detail. This seems like a bug then since it was working when we verified https://github.com/vectordotdev/vector/issues/11405

jszwedko avatar Jan 12 '24 00:01 jszwedko

Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405

It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.

anil-db avatar Jan 17 '24 18:01 anil-db

Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405

It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.

I was gonna ask about that actually. When Vector is shutting down it will attempt to flush the buffers, including disk buffers. In the future, we'd like to have the disk buffers just close during shutdown rather than trying to flush. Do you think, in your case, that it could be waiting to try to flush all of the data out of the disk buffer? Or do you think the disk buffer is empty but it is still waiting to shutdown until the batch timeouts trigger?

jszwedko avatar Jan 17 '24 18:01 jszwedko

Thanks for the additional detail. This seems like a bug then since it was working when we verified #11405

It seems it only happen when we use disk buffer with s3 sink. when using memory buffer flush triggers immediately at shutdown.

I was gonna ask about that actually. When Vector is shutting down it will attempt to flush the buffers, including disk buffers. In the future, we'd like to have the disk buffers just close during shutdown rather than trying to flush. Do you think, in your case, that it could be waiting to try to flush all of the data out of the disk buffer? Or do you think the disk buffer is empty but it is still waiting to shutdown until the batch timeouts trigger?

we had just 1 event so reader side of disk buffer was already upto date, It seems reader side of disk buffer do not get signal. while in case of memory buffer when limitedSender object drop limitedReceiver side poll get finished. this signaling is not happening in disk buffer, as per my understanding.

anil-db avatar Jan 18 '24 02:01 anil-db

Interesting, thanks for the context. I updated the title since I think it affects more than just the AWS S3 sink.

jszwedko avatar Jan 19 '24 21:01 jszwedko

Did this get clarified? Kind of worrying that flushing logs is a potential pain point here.

danielloader avatar Sep 12 '24 10:09 danielloader

I'm also curious on the status of this. This potentially affects a use case for Vector that I have.

calebschoepp avatar Oct 07 '24 15:10 calebschoepp

I've also observed this behavior in the GCS sink when using the disk buffer:

2024-10-08T11:23:17.587387Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="24 seconds left"                                                                                                                 
2024-10-08T11:23:22.587394Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="19 seconds left"                                                                                                                 
2024-10-08T11:23:27.587294Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="14 seconds left"                                                                                                                 
2024-10-08T11:23:32.586562Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="9 seconds left"                                                                                                                  
2024-10-08T11:23:37.587402Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gcs" time_remaining="4 seconds left"                                                                                                                  
ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="gcs"

benjamin-awd avatar Oct 08 '24 11:10 benjamin-awd

Also see this behaviour using HTTP sink that has connection error with disk buffer

2024-12-09T20:33:02.050407Z  WARN sink{component_kind="sink" component_id=parser component_type=http}:request{request_id=1}: vector::sinks::util::retries: Retrying after error. error=Failed to make HTTP(S) request: error trying to connect: tcp connect error: Connection refused (os error 61) internal_log_rate_limit=true
2024-12-09T20:33:03.175521Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="parser" time_remaining="29 seconds left"
2024-12-09T20:33:08.175415Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="parser" time_remaining="24 seconds left"
2024-12-09T20:33:13.175147Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="parser" time_remaining="19 seconds left"
2024-12-09T20:33:18.175398Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="parser" time_remaining="14 seconds left"
2024-12-09T20:33:23.175487Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="parser" time_remaining="9 seconds left"

maxbolgarin avatar Dec 09 '24 20:12 maxbolgarin

Interesting, thanks for the context. I updated the title since I think it affects more than just the AWS S3 sink.

I'd like to ping status on this again. The issue isn't whether or not batches still sink during a graceful shutdown, the issue is they're not sinked immediately. So if you have a reasonable 300 second batch timeout, your graceful shutdown needs to be >300 seconds.

The change I'd like to see is, when a graceful shutdown is triggered, all outstanding batches and buffers should attempt to drain immediately.

timaelliott avatar Jul 04 '25 14:07 timaelliott

Hello! We're also facing a problem where Vector can't perform a graceful shutdown when using a disk buffer.

In our Kubernetes environment, logs weren't being sent to S3 correctly unless batch.timeout_secs was greater than terminationGracePeriodSeconds .

After some investigation, it seems that Vector might not be flushing the logs remaining in the buffer even after receiving a SIGTERM signal. Parameters like VECTOR_NO_GRACEFUL_SHUTDOWN_LIMIT didn't solve the issue; it just changed whether Vector or Kubernetes performed the forced shutdown.

okazaki-kk avatar Oct 16 '25 08:10 okazaki-kk

I conducted tests to reproduce this issue and can confirm the behavior difference:

Test Setup:

  • 5 test events written to a file source
  • S3 sink with max_events: 100000 and timeout_secs: 300
  • Graceful shutdown initiated with SIGTERM after 3 seconds
  • Test on macOS, but same thing may happen on docker, kubernetes..

Expected Behavior:

  • The graceful shutdown is performed upon receiving a SIGTERM, and the logs remaining in the buffer are flushed correctly

Results:

  • Disk buffer (bash test-shutdown.sh disk): Failed to send logs to S3, data was lost
  • log: 2025-10-28T05:50:41.885991Z ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="s3" killed by VECTOR_GRACEFUL_SHUTDOWN_LIMIT_SECS
  • Memory buffer (bash test-shutdown.sh memory): Successfully sent all logs to S3

This confirms that disk buffers do not properly flush incomplete batches during graceful shutdown, while memory buffers handle the same scenario correctly.


Reproducible Example

test-shutdown.sh
#!/bin/bash
# Tests graceful shutdown behavior with disk vs memory buffers
# Usage: bash test-shutdown.sh [disk|memory]

set -e

# Configuration
BUFFER_TYPE=$1  # disk or memory
CONFIG_FILE="test-shutdown-${BUFFER_TYPE}.yaml"
LOG_FILE="shutdown-test-${BUFFER_TYPE}.log"
TEST_LOG_FILE="test.log"

echo "========================================"
echo "Testing with ${BUFFER_TYPE} buffer"
echo "========================================"

# Cleanup previous test artifacts
rm -f "${LOG_FILE}"
rm -f "${TEST_LOG_FILE}"

# Create empty log file for Vector to watch
touch "${TEST_LOG_FILE}"

# Start Vector in background with trace logging
echo "Starting Vector..."
vector --config "${CONFIG_FILE}" -vvv > "${LOG_FILE}" 2>&1 &
VECTOR_PID=$!

echo "Vector PID: ${VECTOR_PID}"
echo "Waiting for Vector to start..."
sleep 10

# Write test events after Vector has started file watching
# Using small number of events so batch won't fill (max_events=100000)
echo "Writing test events to ${TEST_LOG_FILE}..."
cat >> "${TEST_LOG_FILE}" <<EOF
{"message": "test event 1", "timestamp": "2025-01-01T00:00:00Z", "level": "info"}
{"message": "test event 2", "timestamp": "2025-01-01T00:00:01Z", "level": "info"}
{"message": "test event 3", "timestamp": "2025-01-01T00:00:02Z", "level": "debug"}
{"message": "test event 4", "timestamp": "2025-01-01T00:00:03Z", "level": "info"}
{"message": "test event 5", "timestamp": "2025-01-01T00:00:04Z", "level": "warn"}
EOF

sleep 3
# Send SIGTERM for graceful shutdown
echo ""
echo "Sending SIGTERM to Vector (PID: ${VECTOR_PID})..."
kill -TERM ${VECTOR_PID}

# Monitor shutdown progress
echo "Monitoring shutdown..."
WAIT_COUNT=0
START_TIME=$(date +%s)

while kill -0 ${VECTOR_PID} 2>/dev/null; do
    sleep 1
    WAIT_COUNT=$((WAIT_COUNT + 1))
    ELAPSED=$((WAIT_COUNT))

    if [ $WAIT_COUNT -eq 10 ]; then
        echo "  ${ELAPSED}s elapsed..."
    elif [ $WAIT_COUNT -eq 30 ]; then
        echo "  ${ELAPSED}s elapsed..."
    elif [ $WAIT_COUNT -eq 60 ]; then
        echo "  ${ELAPSED}s elapsed (graceful_shutdown_duration reached)..."
    elif [ $WAIT_COUNT -eq 90 ]; then
        echo "  ${ELAPSED}s elapsed - Force killing Vector..."
        kill -9 ${VECTOR_PID}
        break
    fi
done

END_TIME=$(date +%s)
TOTAL_TIME=$((END_TIME - START_TIME))
if ! kill -0 ${VECTOR_PID} 2>/dev/null; then
    echo "✓ Vector has shut down (${TOTAL_TIME} seconds)"
fi

# Display results
echo ""
echo "========================================"
echo "Results for ${BUFFER_TYPE} buffer:"
echo "========================================"
echo ""

echo ""
echo "Shutdown time: ${TOTAL_TIME} seconds"
echo ""
echo "Full log: ${LOG_FILE}"
echo "View with: less ${LOG_FILE}"
echo ""
test-shutdown-memory.yaml
# Vector Issue #19600 Comparison Test - Memory Buffer Configuration
# This config shows the expected behavior with memory buffers (works correctly)
sources:
  file_source:
    type: file
    include:
      - ./test.log
sinks:
  s3:
    type: aws_s3
    inputs:
      - file_source
    bucket: test-bucket
    region: us-east-1
    compression: gzip
    encoding:
      codec: json
    key_prefix: "vector-test/shutdown-test-memory/"
    batch:
      max_events: 100000
      timeout_secs: 300
    buffer:
    - type: memory
      max_events: 10000

test-shutdown-disk.yaml
# Vector Issue #19600 Reproduction Test - Disk Buffer Configuration
# This config reproduces the graceful shutdown hang issue with disk buffers
sources:
  file_source:
    type: file
    include:
      - ./test.log
sinks:
  s3:
    type: aws_s3
    inputs:
      - file_source
    bucket: test-bucket
    region: us-east-1
    compression: gzip
    encoding:
      codec: json
    key_prefix: vector-test/shutdown-test-disk/
    batch:
      max_events: 100000
      timeout_secs: 300
    buffer:
    - type: disk
      max_size: 268435488

okazaki-kk avatar Oct 28 '25 07:10 okazaki-kk