azure-functions-host
azure-functions-host copied to clipboard
Redesign how we log worker process errors
In the current implementation of how we handle worker process errors, we log a message as an error
based on if the message contains the words "error", "fail" or "severe", this is fragile and does not cover all cases on when an issue has occurred i.e. exceptions etc.
# Method: ParseErrorMessageAndLog
if ((msg.IndexOf("error", StringComparison.OrdinalIgnoreCase) > -1) ||
(msg.IndexOf("fail", StringComparison.OrdinalIgnoreCase) > -1) ||
(msg.IndexOf("severe", StringComparison.OrdinalIgnoreCase) > -1))
{
LogError(msg);
}
It's worth noting that this method (ParseErrorMessageAndLog
) is only used by the error callback:
private void OnErrorDataReceived(object sender, DataReceivedEventArgs e)
{
if (e.Data != null)
{
ParseErrorMessageAndLog(e.Data);
}
}
Maybe it is safe to assume that all messages coming from the worker through the error event listener Process.ErrorDataReceived += (sender, e) => OnErrorDataReceived(sender, e);
should be logged as errors, or at least warnings?
Following tasks were identified as next steps to resolve this workitem
- [ ] Function host: Review the option of exposing UseStdErrorStreamForErrorsOnly setting to users under a more descriptive app setting value
- [ ] Extension Authors: Look into the possibility of intercepting the stdError stream, identify log level through language specific constructs and pass that info via a log prefix.
- [ ] Investigation task on dotnet worker (#1116)
- [ ] Identify what logs we are missing by introducing worker faults at different points of execution
@pragnagopa do you have context on why we're not treating all console logging written to std error as error messages? I recall some workers logging everything that way. Was that really the case?
I recall some workers logging everything that way.
That's right - Java
writes to stderr stream for all msgs. Would be good to expose this option in worker.config as an option for the worker to pass if host can consider stderr stream as error level logs
I just realized - we did add the capability - https://github.com/Azure/azure-functions-host/blob/dev/src/WebJobs.Script/Workers/Rpc/RpcWorkerDescription.cs#L82 - so worker should be able to specify if stderr stream should be considered as error logs only - @liliankasem / @fabiocav - please give that a try
Ah, that's great! We'll validate and close this issue once we do. We'll work with the language worker owners to update this where appropriate.
It doesn't look like any of the language workers are utilizing this in their worker.config files. The default value for UseStdErrorStreamForErrorsOnly
is false, meaning logs from stderr are written with info level by default unless the message contains the keywords "error" , "severe" or "fail", then the logs are written with error level.
@pragnagopa from my understanding, Java is the only language that writes to the stderr stream for all msgs and so only Java should have UseStdErrorStreamForErrorsOnly == false
and it should be true
for all other languages?
cc: @ejizba, @Francisco-Gamino, @kaibocai, @kshyju, @vrdmr - FYI, this might be impacting how errors from workers are logged and depending on logging behaviour, you maybe need to update your worker.config file
Example
Current behaviour -> an error in the process logs as info
Behaviour with UseStdErrorStreamForErrorsOnly set to true
@pragnagopa from my understanding, Java is the only language that writes to the stderr stream for all msgs and so only Java should have
UseStdErrorStreamForErrorsOnly == false
and it should betrue
for all other languages?
Hi @liliankasem , java is not writing all msgs to stderr stream actually, we are check the level the of the msg, if it's smaller than or equal to INFO then we write to stdout stream, else we write them to stderr stream. Please find the logic here
That's great! If noone is writing all msgs to stderr stream then we can just switch the default to "true" and noone needs to make any changes. I'll wait to hear from other language owners to confirm if other's might be using stderr for all messages
We rarely use the stdout/stderr streams in the first place - we prioritize sending logs over the rpc channel. That being said, users can write directly to stdout/stderr themselves, so I would be hesitant to change this behavior (worker logs should be prefixed with "LanguageWorkerConsoleLog" and user logs are not).
Fwiw, I think the existing default is the "correct" behavior, although perhaps not the most intuitive behavior. Stderr should be assumed to be any diagnostic data, not just error data. I would advocate for improving the error detection (add exception to the list?) and somehow pressuring both users & workers to send proper rpc messages instead where they have to explicitly choose the log level
I agree that we should encourage the right behaviour around logging these issue + improved error detection, but I also think that the right approach here is to log stderr as error logs, it is after all in the name; diagnostic events for me fall under stdout. I'd like to encourage folks to use these streams appropriately so that we can make sure failures are not logged as info like in the example I gave above.
I think you make a great point about users being able to log to stderr/out. This does complicate things a bit and might make this a breaking change. We could potentially make this a worker capability, rather than a worker.config property as it is now, and allow users to opt-in/out of this since we don't know the impact it would have.
@fabiocav thoughts? Do we want to punt this out to a major version release or address it behind a feature flag?
A few references pointing to stderr
being used for diagnostic data, although I agree the name is terrible:
- https://stackoverflow.com/questions/3385201/confused-about-stdin-stdout-and-stderr
- https://pubs.opengroup.org/onlinepubs/9699919799/functions/stderr.html
This came up a lot when I worked on VS Code tooling that referenced a cli (usually core tools). Here is theoretical output for the command func --version
:
WARNING: You are on an old version of core tools. Please upgrade now %%Cool Functions lightning bolt Logo%% INFO: We are collecting telemetry. Use a setting to turn it off 1.0.0
It's really important that "1.0.0" is the only output that goes to "stdout" for anyone parsing the result of func --version
(aka our VS Code extension). That means all the other stuff has to go to "stderr" even though not all of it is an error. This situation is the main reason that many cli's use "stderr" for diagnostic data because they want to provide parse-able output and diagnostic data and you need two different streams to do that. I understand the host doesn't fit neatly into this "parseable output for a cli" scenario, but just explaining why I think using stderr for diagnostic data is a lot more common that you'd think based on the name.
Ahh that is very interesting and makes sense, thank you for sharing!
Have spoken to several language owners on this topic and it appears that this will be beneficial in catching issues that have been missed in our kusto logs. Will need to have a further discussion on usage of RpcLog vs stderr/stdout. We also discussed changing this to be a capability instead of a worker.config setting.
@fabiocav if we feel this is a breaking change, we can punt this down the road, otherwise we can push to next sprint to start making changes.
Sounds great! Moving to sprint 126 to start executing on the proposed design
This is assigned to sprint 126 as a stretch goal item.
@soninaren has a design proposal and is scheduling time for the review. Will share more information here once done and create the associated tasks/issues to track the implementation.
Adding the action items identified:
- Function host: Review the option of exposing UseStdErrorStreamForErrorsOnly setting to users under a more descriptive app setting value
- Extension Authors: Look into the possibility of intercepting the stdError stream, identify log level through language specific constructs and pass that info via a log prefix.
Making this an epic and adding tracking items and tasks for next steps