boto3 icon indicating copy to clipboard operation
boto3 copied to clipboard

CloudWatch logs get_log_events infinite loop with nextForwardToken for expired log groups

Open dxdc opened this issue 4 months ago • 10 comments

Describe the bug

Description

I'm experiencing the same infinite loop issue with nextForwardToken described in #4472, but I've identified a specific case that triggers this behavior: querying log groups with expired logs (logs older than the retention period).

Root Cause Identified

This infinite loop occurs when:

  1. Log group has a retention period set (e.g., 1 year)
  2. Querying logs that are older than the retention period (expired logs)
  3. get_log_events continues returning new nextForwardToken values indefinitely
  4. No events are returned in the response, but the API keeps providing new tokens

Comparison with AWS CLI

The AWS CLI handles this scenario correctly:

  • aws logs get-log-events for expired logs returns no events and terminates properly
  • Does not get stuck in infinite pagination loops

This suggests the issue may be in how boto3/botocore implements the pagination logic versus the AWS CLI implementation.

Impact

  • Applications get stuck in infinite loops when processing old log streams
  • Unnecessary API calls and costs
  • Resource consumption (CPU, memory) from endless pagination
  • Difficult to implement robust log processing without workarounds

Workaround

Current workaround is to implement additional checks:

consecutive_empty_responses = 0
max_empty_responses = 3  # Adjust based on your needs

while next_token != response.get("nextForwardToken", ""):
    # ... existing code ...
    
    if not response["events"]:
        consecutive_empty_responses += 1
        if consecutive_empty_responses >= max_empty_responses:
            print("Breaking due to consecutive empty responses (likely expired logs)")
            break
    else:
        consecutive_empty_responses = 0  # Reset counter

This issue affects production systems that process historical logs and need reliable pagination behavior. A fix or clear documentation would be greatly appreciated.

Regression Issue

  • [ ] Select this option if this issue appears to be a regression.

Expected Behavior

According to AWS CloudWatch Logs documentation, when no more events are available, the API should either:

  • Stop returning nextForwardToken, OR
  • Return the same nextForwardToken value to indicate pagination is complete

Expected (similar to AWS CLI):

Iteration: 1 - Next token: None
No events found in iteration 1
Total iterations: 1
Total events: 0
Duration: 0.15 seconds

Current Behavior

When querying expired logs (older than retention period):

  • get_log_events returns {'events': [], 'nextForwardToken': 'new_token_value'}
  • Each subsequent call returns a different nextForwardToken
  • This creates an infinite loop as the condition next_token != response.get("nextForwardToken", "") never becomes true
  • No events are ever returned, but pagination continues indefinitely
Iteration: 1 - Next token: None
No events found in iteration 1
Iteration: 2 - Next token: f/38843677690954359307332779238523611073553355858145312768/s
No events found in iteration 2
Iteration: 3 - Next token: f/38843871704337782941158364500062913595036750815702024192/s
No events found in iteration 3
... (continues indefinitely)

Reproduction Steps

  1. Create a CloudWatch log group with 1-year retention
  2. Wait for logs to expire (or use existing expired logs)
  3. Attempt to query expired logs using get_log_events with pagination:
import boto3
import time

client = boto3.client('logs', region_name='us-east-1')

params = {
    'logGroupName': '/your/log/group',
    'logStreamName': 'expired-log-stream',
    'startFromHead': True,
    'limit': 1000
}

events = []
response = {}
next_token = None
count = 0
start_time = time.time()

# This loop will run indefinitely for expired logs
while next_token != response.get("nextForwardToken", ""):
    count += 1
    next_token = response.get("nextForwardToken")
    
    print(f"Iteration: {count} - Next token: {next_token}")
    
    if next_token:
        params["nextToken"] = next_token
    
    response = client.get_log_events(**params)
    
    if response["events"]:
        print(f"Events found: {len(response['events'])}")
        events.extend(response["events"])
    else:
        print(f"No events found in iteration {count}")
    
    # Safety break for demonstration (remove to see infinite loop)
    if count > 20:
        print("Breaking to prevent infinite loop...")
        break

print(f"Total iterations: {count}")
print(f"Total events: {len(events)}")
print(f"Duration: {time.time() - start_time:.2f} seconds")

Possible Solution

  1. Fix in boto3/botocore: Modify pagination logic to detect when expired logs are being queried and handle accordingly
  2. AWS API fix: Address this at the CloudWatch Logs API level to match AWS CLI behavior
  3. Documentation update: Clearly document this edge case and provide recommended workarounds

Additional Information/Context

Related Issues

  • Original issue: #4472
  • Similar pagination issues: #3791, #2411

Questions

  1. Is this considered a bug in the AWS CloudWatch Logs API itself?
  2. Should boto3 implement special handling for this scenario?
  3. Are there plans to align boto3 behavior with AWS CLI for this case?

SDK version used

boto3==1.40.4

Environment details (OS name and version, etc.)

macOS 15.6

dxdc avatar Aug 13 '25 14:08 dxdc

Hi @dxdc, thanks for reaching out. I've been trying to reproduce this issue with your exact code with no luck. A couple questions to help me investigate,

  1. Does this happen to log groups that contain both unexpired and expired logs?
  2. Could you provide debug logs of this behavior? You can get debug logs by adding boto3.set_stream_logger('') to the top of your script and redacting any sensitive information.

RyanFitzSimmonsAK avatar Sep 09 '25 18:09 RyanFitzSimmonsAK

thanks for looking into this!

  1. I haven't tested with purely expired logs, so this is for a mix of expired and unexpired. My logs are set to expire after 1 year, but not sure why that would be necessarily relevant.

  2. debug_logs.txt

one other thought. these cloudwatch logs are encrypted with a KMS key. not sure if that's relevant or not.

dxdc avatar Sep 09 '25 19:09 dxdc

Thanks for providing that. I noticed a note on the documentation for this operation that seems to describe the behavior you're seeing.

If you set startFromHead to true and you don’t include endTime in your request, you can end up in a situation where the pagination doesn't terminate. This can happen when the new log events are being added to the target log streams faster than they are being read. This situation is a good use case for the CloudWatch Logs Live Tail feature.

Does adding endTime resolve the issue?

RyanFitzSimmonsAK avatar Sep 09 '25 20:09 RyanFitzSimmonsAK

Good thought. I tried that previously, but it didn't have any effect. Code snippet below. This issue also only appears with expired logs, so I don't even think the endTime param matters here.

now = time.time()
one_day = 24 * 60 * 60
today_midnight = int(now // one_day * one_day)

params = {
    'logGroupName': '/your/log/group',
    'logStreamName': 'expired-log-stream',
    'startFromHead': True,
    'endTime': today_midnight * 1000,
    'limit': 1000,
}

dxdc avatar Sep 09 '25 20:09 dxdc

Gotcha. In that case,

  1. Could you provide AWS CLI debug logs of it working as intended? I'd like to compare them to the Boto3 logs to see how the requests differ.
  2. How old are these log streams? Could you test if this behavior appears on relatively new log streams?
  3. Could you test with a log stream containing purely expired logs?

RyanFitzSimmonsAK avatar Sep 09 '25 20:09 RyanFitzSimmonsAK

Discovered something very interesting!

  • log e71fdadffa3c45ba901d59dbb52a56f0 is from today, and works normally.
  • log d4c101fd57a64fb7815c66b647faaef4 is from 2-3 weeks ago, and produces events, but still results in an infinite loop.

I will share debug logs shortly.

For log stream: {'logGroupName': '/aws/ecs/**********', 'startFromHead': True, 'limit': 1000, 'logStreamName': '**********/d4c101fd57a64fb7815c66b647faaef4'}
Iteration: 1 - Next token: None
Events found: 1000
Iteration: 2 - Next token: f/39149746037527408698891181399229890635442266865350804419/s
Events found: 980
Iteration: 3 - Next token: f/39149746060184965820598294511030180400453002155426908055/s
No events found in iteration 3
Iteration: 4 - Next token: f/39149937198745874791921132157236531283045249694428004351/s
No events found in iteration 4
Iteration: 5 - Next token: f/39150719954902343216793400060947901240534109607029604351/s
No events found in iteration 5
Iteration: 6 - Next token: f/39150980873621166025084156028851691226363729577896804351/s
No events found in iteration 6
Iteration: 7 - Next token: f/39151208341222191037440199693178072239651090578140004351/s
No events found in iteration 7
Iteration: 8 - Next token: f/39151409047928977813048473514642526074904644401884004351/s
No events found in iteration 8
Iteration: 9 - Next token: f/39151616444859324147843689796822461704666650019752804351/s
No events found in iteration 9
Iteration: 10 - Next token: f/39151817151566110923451963618286915539920203843496804351/s
No events found in iteration 10
Iteration: 11 - Next token: f/39151904124472385192882215607588178868530077167119204351/s
No events found in iteration 11
Iteration: 12 - Next token: f/39151930885366623429629985450450106046563884343618404351/s
No events found in iteration 12
Iteration: 13 - Next token: f/39151957646260861666377755293312033224597691520117604351/s
No events found in iteration 13
Iteration: 14 - Next token: f/39151984407155099903125525136173960402631498696616804351/s
No events found in iteration 14
Iteration: 15 - Next token: f/39152004477825778580686352518320405786156854078991204351/s
No events found in iteration 15
Iteration: 16 - Next token: f/39152031238720016817434122361182332964190661255490404351/s
No events found in iteration 16
Iteration: 17 - Next token: f/39152613288169698466698116443429249086425967344348004351/s
No events found in iteration 17
Iteration: 18 - Next token: f/39152640049063936703445886286291176264459774520847204351/s
No events found in iteration 18
Iteration: 19 - Next token: f/39152666809958174940193656129153103442493581697346404351/s
No events found in iteration 19
Iteration: 20 - Next token: f/39152686880628853617754483511299548826018937079720804351/s
No events found in iteration 20
Iteration: 21 - Next token: f/39152860826441402156614987489902075483238683726965604351/s
No events found in iteration 21
Breaking to prevent infinite loop...
Total iterations: 21
Total events: 1980
Duration: 21.03 seconds


For log stream: {'logGroupName': '/aws/ecs/**********', 'startFromHead': True, 'limit': 1000, 'logStreamName': '**********/e71fdadffa3c45ba901d59dbb52a56f0'}
Iteration: 1 - Next token: None
Events found: 1000
Iteration: 2 - Next token: f/39192458539935015883765240407607258618431645714316985243/s
Events found: 1000
Iteration: 3 - Next token: f/39192458551754410838986470672621189302935277312486606723/s
Events found: 106
Iteration: 4 - Next token: f/39192459054792320282241736906419175090612896779835342854/s
No events found in iteration 4
Iteration: 5 - Next token: f/39192459054803470652716374750805494625846529345040908287/s
No events found in iteration 5
Total iterations: 5
Total events: 2106
Duration: 1.12 seconds

dxdc avatar Sep 09 '25 21:09 dxdc

Happy to share the modified code if you want, but I just tweaked it to process multiple log streams at the same time.

aws logs get-log-events \                                                                                                                                                            
  --log-group-name "/aws/ecs/*****" \
  --log-stream-name "*******/e71fdadffa3c45ba901d59dbb52a56f0" \
  --start-from-head \
  --query 'events[].{ts:timestamp}' \
  --output json \
  --no-cli-pager \
  --debug

dxdc avatar Sep 09 '25 21:09 dxdc

That's really strange. So you have three different log streams with three different behaviors.

  • The first log stream (which you said DOES contain unexpired events?) never returned any, and went immediately into an infinite loop.
  • The second log stream (created 2-3 weeks ago) returned events, and then went into an infinite loop.
  • The third log stream (created today) worked as intended.

Is that all correct? Since the second and third log streams were less than a year old, do they contain expired events?

At this point, I am fairly convinced that the SDK isn't at fault here, since really the issue is with a new token being returned by the service each time. So, I'm going to cut the service team a ticket for their insights. For the sake of thoroughness, could you check what happens if you remove limit from the request?

RyanFitzSimmonsAK avatar Sep 09 '25 22:09 RyanFitzSimmonsAK

Is that all correct?

For second and third, yes.

The first log stream (which you said DOES contain unexpired events?) never returned any, and went immediately into an infinite loop.

The expired log stream never returns any and goes into an infinite loop. There are essentially 3 behaviors I've been able to reproduce:

  • expired logs (0 events) --> infinite loop
  • 2-3 week old log (not expired, ~2000 events) --> infinite loop
  • recent log from today (not expired, ~2000 events) --> correct behavior

As I mentioned this log is also encrypted with a KMS key, not sure if that's pertinent.

Since the second and third log streams were less than a year old, do they contain expired events?

No. The events within the log are all within 5 minutes. Each event is essentially a single ECR run so it's quite discrete.

For the sake of thoroughness, could you check what happens if you remove limit from the request?

Tried that too. No change.

At this point, I am fairly convinced that the SDK isn't at fault here, since really the issue is with a new token being returned by the service each time.

Thank you for following up on this. I'm wondering if the aws cli doesn't show this because if no events are returned, maybe the behavior is handled differently.

dxdc avatar Sep 09 '25 22:09 dxdc

Gotcha, thank you for that. I've reached out to the service team for their insights.

Ticket # for internal use : P301022916

RyanFitzSimmonsAK avatar Sep 12 '25 21:09 RyanFitzSimmonsAK