amazon-managed-service-for-prometheus-roadmap icon indicating copy to clipboard operation
amazon-managed-service-for-prometheus-roadmap copied to clipboard

improved logging in workspace

Open elasticdotventures opened this issue 9 months ago • 0 comments

I've reported this to AWS support as well.

As near as I can tell 100% of the log messages in AWS cortex are useless. Log messages should provide a hint about the context of the error, they should help in diagnosing any issues or unexpected behaviors. If log messages fail to do that for whatever reason then they don't need to exist and they are just making useless noise.

logs should provide clarity and not require the administrator to guess, we have literally dozens of routes, hundreds of rules, and troubleshooting them is a huge issue. We use prometheus pint (a linter) to catch most types of errors.

I feel compelled to remind everybody that any problem on an infrastructures monitoring platform is a P1 priority, because it means alarms can get missed.

{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "MessageAttributes has been removed because of invalid key/value, numberOfRemovedAttributes=1",
        "level": "WARN"
    },
    "component": "alertmanager"
}

Suggestions:

  • provide the key or value that was invalid, along with the regex that it expects.

{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "Subject has been modified because it is empty.",
        "level": "WARN"
    },
    "component": "alertmanager"
}

Suggestions:

  • provide some other context about the message, which group, rule id, or other meaningful way of narrowing down the possiblity.
{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "Message has been modified because the content was empty.",
        "level": "WARN"
    },
    "component": "alertmanager"
}

Suggestions:

  • I have no idea how/why this happens, but usually (**I assume) it's a template failure, you might put the raw version of the message into the output, you could also include a stack-trace of the variables .. perhaps another DEBUG mode for logs.
{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "Notify for alerts failed, Invalid parameter: TopicArn",
        "level": "ERROR"
    },
    "component": "alertmanager"
}

Suggestions:

  • First off, I'm not setting this, second, is it invalid because it's not set, or invalid because the content of the TopicArn? Nobody at AWS seems to know, the online searches suggest this has something to do with a region issue (which I assume is related to SigV4 ?? but I don't really know or care, the error is obtuse and lacks any value)

elasticdotventures avatar Sep 14 '23 00:09 elasticdotventures