azure-functions-host icon indicating copy to clipboard operation
azure-functions-host copied to clipboard

Redesign how we log worker process errors

Open liliankasem opened this issue 2 years ago • 17 comments

In the current implementation of how we handle worker process errors, we log a message as an error based on if the message contains the words "error", "fail" or "severe", this is fragile and does not cover all cases on when an issue has occurred i.e. exceptions etc.

# Method: ParseErrorMessageAndLog
 if ((msg.IndexOf("error", StringComparison.OrdinalIgnoreCase) > -1) ||
     (msg.IndexOf("fail", StringComparison.OrdinalIgnoreCase) > -1) ||
     (msg.IndexOf("severe", StringComparison.OrdinalIgnoreCase) > -1))
{
    LogError(msg);
}

Link to code

It's worth noting that this method (ParseErrorMessageAndLog) is only used by the error callback:

private void OnErrorDataReceived(object sender, DataReceivedEventArgs e)
{
    if (e.Data != null)
    {
        ParseErrorMessageAndLog(e.Data);
    }
}

Maybe it is safe to assume that all messages coming from the worker through the error event listener Process.ErrorDataReceived += (sender, e) => OnErrorDataReceived(sender, e); should be logged as errors, or at least warnings?

Following tasks were identified as next steps to resolve this workitem

  • [ ] Function host: Review the option of exposing UseStdErrorStreamForErrorsOnly setting to users under a more descriptive app setting value
  • [ ] Extension Authors: Look into the possibility of intercepting the stdError stream, identify log level through language specific constructs and pass that info via a log prefix.
    • [ ] Investigation task on dotnet worker (#1116)
  • [ ] Identify what logs we are missing by introducing worker faults at different points of execution

liliankasem avatar Mar 30 '22 00:03 liliankasem

@pragnagopa do you have context on why we're not treating all console logging written to std error as error messages? I recall some workers logging everything that way. Was that really the case?

fabiocav avatar Apr 20 '22 20:04 fabiocav

I recall some workers logging everything that way.

That's right - Java writes to stderr stream for all msgs. Would be good to expose this option in worker.config as an option for the worker to pass if host can consider stderr stream as error level logs

pragnagopa avatar Apr 20 '22 20:04 pragnagopa

I just realized - we did add the capability - https://github.com/Azure/azure-functions-host/blob/dev/src/WebJobs.Script/Workers/Rpc/RpcWorkerDescription.cs#L82 - so worker should be able to specify if stderr stream should be considered as error logs only - @liliankasem / @fabiocav - please give that a try

pragnagopa avatar Apr 20 '22 20:04 pragnagopa

Ah, that's great! We'll validate and close this issue once we do. We'll work with the language worker owners to update this where appropriate.

fabiocav avatar Apr 20 '22 21:04 fabiocav

It doesn't look like any of the language workers are utilizing this in their worker.config files. The default value for UseStdErrorStreamForErrorsOnly is false, meaning logs from stderr are written with info level by default unless the message contains the keywords "error" , "severe" or "fail", then the logs are written with error level.

@pragnagopa from my understanding, Java is the only language that writes to the stderr stream for all msgs and so only Java should have UseStdErrorStreamForErrorsOnly == false and it should be true for all other languages?

cc: @ejizba, @Francisco-Gamino, @kaibocai, @kshyju, @vrdmr - FYI, this might be impacting how errors from workers are logged and depending on logging behaviour, you maybe need to update your worker.config file

Example

Current behaviour -> an error in the process logs as info

info

Behaviour with UseStdErrorStreamForErrorsOnly set to true

error

liliankasem avatar Jun 13 '22 18:06 liliankasem

@pragnagopa from my understanding, Java is the only language that writes to the stderr stream for all msgs and so only Java should have UseStdErrorStreamForErrorsOnly == false and it should be true for all other languages?

Hi @liliankasem , java is not writing all msgs to stderr stream actually, we are check the level the of the msg, if it's smaller than or equal to INFO then we write to stdout stream, else we write them to stderr stream. Please find the logic here

kaibocai avatar Jun 13 '22 19:06 kaibocai

That's great! If noone is writing all msgs to stderr stream then we can just switch the default to "true" and noone needs to make any changes. I'll wait to hear from other language owners to confirm if other's might be using stderr for all messages

liliankasem avatar Jun 13 '22 19:06 liliankasem

We rarely use the stdout/stderr streams in the first place - we prioritize sending logs over the rpc channel. That being said, users can write directly to stdout/stderr themselves, so I would be hesitant to change this behavior (worker logs should be prefixed with "LanguageWorkerConsoleLog" and user logs are not).

Fwiw, I think the existing default is the "correct" behavior, although perhaps not the most intuitive behavior. Stderr should be assumed to be any diagnostic data, not just error data. I would advocate for improving the error detection (add exception to the list?) and somehow pressuring both users & workers to send proper rpc messages instead where they have to explicitly choose the log level

ejizba avatar Jun 13 '22 21:06 ejizba

I agree that we should encourage the right behaviour around logging these issue + improved error detection, but I also think that the right approach here is to log stderr as error logs, it is after all in the name; diagnostic events for me fall under stdout. I'd like to encourage folks to use these streams appropriately so that we can make sure failures are not logged as info like in the example I gave above.

I think you make a great point about users being able to log to stderr/out. This does complicate things a bit and might make this a breaking change. We could potentially make this a worker capability, rather than a worker.config property as it is now, and allow users to opt-in/out of this since we don't know the impact it would have.

@fabiocav thoughts? Do we want to punt this out to a major version release or address it behind a feature flag?

liliankasem avatar Jun 14 '22 00:06 liliankasem

A few references pointing to stderr being used for diagnostic data, although I agree the name is terrible:

  • https://stackoverflow.com/questions/3385201/confused-about-stdin-stdout-and-stderr
  • https://pubs.opengroup.org/onlinepubs/9699919799/functions/stderr.html

This came up a lot when I worked on VS Code tooling that referenced a cli (usually core tools). Here is theoretical output for the command func --version:

WARNING: You are on an old version of core tools. Please upgrade now %%Cool Functions lightning bolt Logo%% INFO: We are collecting telemetry. Use a setting to turn it off 1.0.0

It's really important that "1.0.0" is the only output that goes to "stdout" for anyone parsing the result of func --version (aka our VS Code extension). That means all the other stuff has to go to "stderr" even though not all of it is an error. This situation is the main reason that many cli's use "stderr" for diagnostic data because they want to provide parse-able output and diagnostic data and you need two different streams to do that. I understand the host doesn't fit neatly into this "parseable output for a cli" scenario, but just explaining why I think using stderr for diagnostic data is a lot more common that you'd think based on the name.

ejizba avatar Jun 14 '22 01:06 ejizba

Ahh that is very interesting and makes sense, thank you for sharing!

liliankasem avatar Jun 14 '22 01:06 liliankasem

Have spoken to several language owners on this topic and it appears that this will be beneficial in catching issues that have been missed in our kusto logs. Will need to have a further discussion on usage of RpcLog vs stderr/stdout. We also discussed changing this to be a capability instead of a worker.config setting.

@fabiocav if we feel this is a breaking change, we can punt this down the road, otherwise we can push to next sprint to start making changes.

liliankasem avatar Jun 22 '22 18:06 liliankasem

Sounds great! Moving to sprint 126 to start executing on the proposed design

fabiocav avatar Jun 22 '22 20:06 fabiocav

This is assigned to sprint 126 as a stretch goal item.

fabiocav avatar Jul 06 '22 20:07 fabiocav

@soninaren has a design proposal and is scheduling time for the review. Will share more information here once done and create the associated tasks/issues to track the implementation.

fabiocav avatar Aug 03 '22 20:08 fabiocav

Adding the action items identified:

  • Function host: Review the option of exposing UseStdErrorStreamForErrorsOnly setting to users under a more descriptive app setting value
  • Extension Authors: Look into the possibility of intercepting the stdError stream, identify log level through language specific constructs and pass that info via a log prefix.

soninaren avatar Oct 05 '22 20:10 soninaren

Making this an epic and adding tracking items and tasks for next steps

soninaren avatar Oct 12 '22 19:10 soninaren