powertools-lambda-python icon indicating copy to clipboard operation
powertools-lambda-python copied to clipboard

Feature request: Buffer debug logs and only emit on exception

Open hiselitelordship opened this issue 1 year ago • 7 comments
trafficstars

Use case

CloudWatch Logs are vital, but expensive. To control costs you switch log level from DEBUG to INFO. When an exception happens you have very little context to debug what happened. You swtich back to DEBUG level and hope(!) the error happens again.

Solution/User Experience

By configuring a buffer, debug logs would not be written to CloudWatch Logs unless an exception is encountered. In this way, CWL is low cost and minimalist. But, like a stack trace, when an exception is encountered the buffer is flushed to CWL and all the context of the preceding events is captured.

The buffer should not create memory pressure and become the cause of exceptions. Using something like a ring buffer oldest entries will be lost in favour of the freshest. There is likely a threshold in which a flush to CWL could take too long as an execution approaches the 900s exection threshold. Therefore the memory allocation or number of buffer lines could be configurable.

Alternative solutions

No response

Acknowledgment

hiselitelordship avatar May 31 '24 19:05 hiselitelordship

Thanks for opening your first issue here! We'll come back to you as soon as we can. In the meantime, check out the #python channel on our Powertools for AWS Lambda Discord: Invite link

boring-cyborg[bot] avatar May 31 '24 19:05 boring-cyborg[bot]

(haven't forgotten, adjusting to jet lag from ANZ as I'm back today) - this is a great idea and we were waiting from customer demand to consider it.

I'll share thoughts and questions tomorrow

heitorlessa avatar Jun 03 '24 15:06 heitorlessa

unless an exception is encountered

This is tricky part. As of now, I can think of these scenarios:

  • Handled exceptions. Code is within a try/except block and customers could do one of the following to flush the buffer:
    • [explicit] call logger.hypothetical_flush_debug_buffer() as part of the Except block
    • [implicit] call logger.exception("something went wrong") that by default captures and formats a stacktrace, but we'd transparently call logger.hypothetical_flush_debug_buffer(). Here be dragons as some customers could love or hate this transparent fashion (e.g., how do I disable this?).
  • Unhandled exceptions. Function crashed due to unexpected and unhandled exception. Lambda doesn't support unhandled exception mechanism (#1798), but we could workaround in two ways:
    • If customer sets Logger(log_uncaught_exceptions=True) and uses @logger.inject_lambda_context, we could catch all exceptions, log and flush the debug buffer, and re-raise.
    • We could register a no-op internal extension (separate thread), giving us 500ms for any cleanup upon timeout and such, including background jobs. That however would come at a cost (latency to be measured).

Did I miss anything?

I'd love to hear your thoughts on both.

heitorlessa avatar Jun 04 '24 15:06 heitorlessa

I imagine the new behaviour would be Logger(debug_on_exception=True). Explicit and backward compatible. Existing code would therefore not buffer and every debug line is published as expected. No surprises for anyone already using Powertools.

If I set this new option, I expect the logger.exception to flush out the buffer like a stack trace – that is why I set the option. If I didn’t set the option above, then I don’t want this buffering feature and there will be no stack trace or buffering.

But your point remains. If I want to selectively stop publishing the “stack trace” I can’t.

hiselitelordship avatar Jun 05 '24 16:06 hiselitelordship

I expect the logger.exception to flush out the buffer like a stack trace Explicit and backward compatible

Yeah, that's the direction we'd go given our tenets. On Monday (17th), I'll summarize actionable points and tests we've made while we wait for customers +1.

NOTE: we'll implement this across all languages once we have sufficient demand.

heitorlessa avatar Jun 07 '24 16:06 heitorlessa

First, I love the idea of this feature, it's definitely a "have your observability cake and eat it too" moment.

Logger(debug_on_exception=True)

+1, that seems like the best way to implement to start. An env var which also sets it would be additionally welcome, as then users could click-ops turn on debug tracing in response to an unforeseen issue without a code deploy.

If customer sets Logger(log_uncaught_exceptions=True) and uses @logger.inject_lambda_context, we could catch all exceptions, log and flush the debug buffer, and re-raise.

This seems to me like the least surprising behavior, given a function wrapped with @logger.inject_lambda_context. That seems the most logical hardpoint to catch broad exceptions.

As for in-line vs separate thread, whew, that's a challenging one. I definitely see the appeal of the lambda returning as fast as possible on error, but you really don't want your error handling code to itself get swallowed by the void, so you'd want to especially make sure the thread kicked off has a very high degree of success at emitting the error. As long as extension threads are able to keep the lambda hot long enough to complete the handler, I'd be concerned about the lambda being scaled-in mid-process. But I would think the error packaging would be fast enough that it'd be unlikely to drop (I'm just not super knowledgeable on lambda innards to know what keeps one alive).

Perhaps you could define a handler interface with a default stderr impl and let users inject their own function to run such that they could override the error rollup behavior, use threads or not, customize the debug sink (maybe they want to dump to S3 or datadog or whatever).

xkortex avatar Jun 11 '24 14:06 xkortex

@sthulb when you can, would you mind dropping the memory test you did for verbose log records if kept in a circular buffer for this feature?

We're handling some critical operational bits this week hence my extended delay in responding it. We're still committed to do this feature and plan to deprecate the log sampling (as this is a much better alternative)

heitorlessa avatar Jun 26 '24 08:06 heitorlessa

Hey everyone! We started a RFC draft in this discussion: https://github.com/aws-powertools/powertools-lambda-typescript/discussions/3410

Even though this discussion belongs to the Powertools TypeScript repository, this will be implemented in all runtimes - if the RFC is accepted.

Anyone who wants to participate in the discussion is very welcome and we intend to continue the discussions until mid-January and decide how the implementation will look.

Thank you

leandrodamascena avatar Dec 30 '24 13:12 leandrodamascena

Hi everyone, I'm here with some updates on this issue. We were discussing the implementation and design API in this discussion and came to some decisions. I'll copy @dreamorosi's original message and change the examples/changes to Python instead of TypeScript.

Thanks for the thoughtful answer.

After some additional discussion with the team, I think we settled on an API that should accommodate all the requirements that we discussed.

Below are the decision points that we can use to guide the implementation:

When do we buffer

With this in mind, I think my final position would be to follow these rules:

  • We use _X_AMZN_TRACE_ID (as you suggested) as key to group the logs that belong to an invocation and buffer them
  • If you emit logs during the INIT phase (aka outside of the scope of an invocation) these logs are never buffered, this is because _X_AMZN_TRACE_ID is not yet present there

Regarding the second point, the experience would look like this:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config, level="DEBUG")

logger.debug("init complete")  # Not buffered

def lambda_handler(event: dict, context: LambdaContext) -> dict:
    logger.debug("start request")  # Buffered

which translates, in your logs to this:

Request 1

{ "message": "init complete", "level": "DEBUG", ...other keys }

Request 2

// no logs

What do we buffer

By default, when buffering is enabled, we'll buffer only log levels DEBUG and TRACE. We settled on this default because we want to nudge customers to make better design decisions and emit logs at the correct log level. In principle, diagnostic data that might be superfluous unless there's an error should be emitted at one of these levels, while less verbose levels like INFO, and WARN should be used for data that should be logged also during regular operations.

With this in mind, logger would work like this when buffering is enabled:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered
  logger.info('I am an info'); # NOT buffered
  logger.warn('I am a warning'); # NOT buffered
  logger.error('i am an error'); # NOT buffered

We however understand that some customers might want to make different tradeoffs, either because they don't fully control the logs or because they want to reduce costs at minimum and really only emit certain logs, but still have them in a buffer for when things go wrong.

To accommodate this use case, we're also giving the flexibility to change the level at which the logger starts buffering. For example, if I want to buffer up to WARN, I can do so by setting an option in the buffer configuration:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240, buffer_at_verbosity="WARN")
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered
  logger.info('I am an info'); # buffered
  logger.warn('I am a warning'); # buffered
  logger.error('i am an error'); # NOT buffered

How do we empty the buffer

When it comes to flushing the log buffer, after a lot of discussion with @leandrodamascena I think we'd like to settle on three APIs:

  1. manual imperative flushing
  2. flush on logger.error() (enabled by default when buffering is active, can be opted-out)
  3. flush on unhandled error when using decorator (enabled by default when buffering is active, can be opted-out)

manual flushing

The first one is pretty straightforward, customers should be able to flush the buffer at any time by calling a method:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered
  logger.flush_buffer()

Flush on error log

When buffering is enabled, we'll flush the buffer whenever customers call logger.error(). This allows customers to get all the data about an invocation when they need it the most, aka when there's an error.

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered

  try:
     function_doesnt_existts()
  except Exception:
     logger.error('unable to perform operation'); # buffer is flushed 

Customers can also opt-out from this if they want to have tight control over the buffering mechanism:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240, flush_on_error_log=False) # True by default
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug'); # buffered

  try:
     function_doesnt_existts()
  except Exception:
     logger.error('unable to perform operation'); # not buffered, write normal log

  logger.flush_buffer()

flush on unhandled error

When buffering is enabled, and customers are using our inject_lambda_context decorator, we can also intercept unhandled errors and flush the buffer before re-throwing the customer error and flushing the buffer.

What happens when the buffer is full

As discussed above/in other threads, we'll use a circular buffer structure. This means that depending on the maxBytes setting and how many logs are actually buffered, some of the logs previously buffered might be evicted before the buffer is flushed.

To avoid customer confusion, when this happens we'll emit a warning log when the buffer is being flushed. This allows us to respect the customer buffering level while still informing them about the missing logs.

For example:

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240)
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

def lambda_handler(event: dict, context: LambdaContext) -> dict:
  logger.debug('I am a debug 1'); # buffered
  logger.debug('I am a debug 2'); # buffered
  logger.debug('I am a debug 3'); # buffer was full, so "I am a debug 1" was evicted
  logger.flush_buffer()

which results in:

{ "message": "I am a debug 2", "level": "DEBUG", ...other keys }
{ "message": "I am a debug 3", "level": "DEBUG", ...other keys }
{ "message": `One or more log entries were evicted from the buffer because it got full. Consider increasing the "max_bytes" setting if you want to retain more logs`, "level": "WARN", ...other keys }

How do you optimize for buffer storage

By default we'll set the buffer to a relatively low amount (TBD) and while customers can configure this amount, certain use cases might require additional optimization. For these cases we'll allow customers to enable buffer compression which optimizes for space, at the cost of a CPU overhead (its actual impact will depend on the Lambda function allocated resources).

from aws_lambda_powertools import Logger
from aws_lambda_powertools.logging.buffer import BufferConfig

# Initialize the logger
buffer_config = BufferConfig(max_bytes=10240, compress: True) # False by default
logger = Logger(service="payment",  buffer_config=buffer_config) # INFO is the default log level

When this setting is enabled, we'll use the zlib module from the standard library to compress the logs as they are added to the buffer, and decompress them if/when the buffer is flushed.

leandrodamascena avatar Feb 08 '25 13:02 leandrodamascena

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Mar 07 '25 08:03 github-actions[bot]

Thanks everyone for your inputs here. This feature was released today in v3.8.0 and you check the documentation: https://docs.powertools.aws.dev/lambda/python/latest/core/logger/#buffering-logs.

leandrodamascena avatar Mar 07 '25 20:03 leandrodamascena

This is now released under 3.9.0 version!

github-actions[bot] avatar Mar 25 '25 11:03 github-actions[bot]