UI: Redact sensitive info in local log files
Description
Adds regular expression replacements to obs-app's do_log function to redact sensitive or personally identifiable information from strings stored in local log files.
Expressions are stored in log-redaction.hpp, and redaction is performed in log-redaction.cpp. The current redactions performed include cross-platform usernames, rtmps:// URLs past the domain, srt:// URL parameters, and obs-websocket connection addresses.
Currently, all redactions are performed on all logged strings.
Sensitive info will still be present in stdout for any rare case where it's necessary.
Motivation and Context
Log files are often used when debugging issues with OBS. For the benefit of user privacy, OBS should redact sensitive or personally identifiable information from logs whenever possible. Redaction is performed when uploading logs to the OBS website, for example. This PR extends redaction to local log files, which can often be shared directly when debugging certain issues.
Discussion
Performing seven regex replacement calls on all logged strings is somewhat heavy handed, and certainly has not been optimized. In my testing, OBS's startup time and performance was not impacted meaningfully, but more testing should be performed, especially for setups that are performing heavy amounts of logging. It seems like ideally, logging would not take place on the main thread if we are doing these sorts of regular expression replacements.
Initially, I looked at doing redaction at the level of logging calls, but there seemed to be a huge number of places within OBS where logging is performed with file paths or other sensitive parameters. A centralized solution seemed much cleaner and easier to maintain. Another advantage of performing redaction here is that any logging done by plugins should be caught.
The regular expressions used have been adapted from PHP regexes used on the OBS Project website. They have been tested, but regular expressions are not my forté, and more eyes on them would definitely be appreciated.
Performing redaction all at once, for example on quit, was also considered. However, logs are often shared from crashes or other cases where OBS may still be open. It seems preferable to simply never write the sensitive information in the first place.
How Has This Been Tested?
Tested locally on macOS 14.1.2 on Apple Silicon. Each regex was tested with source names that fit the parameters being redacted (sample SRT or RTMPS urls, Windows-style paths including usernames, etc.) to make sure that parameters were redacted correctly.
Types of changes
- New feature (non-breaking change which adds functionality)
Checklist:
- [x] My code has been run through clang-format.
- [x] I have read the contributing document.
- [x] My code is not on the master branch.
- [x] The code has been tested.
- [x] All commit messages are properly formatted and commits squashed where appropriate.
- [x] I have included updates to all appropriate documentation.
I would like to see some performance tests to make sure this isn't going to cause issues. Since this is run before the too_many_repeated_entries check, a spamming ffmpeg device for example would cause a lot of string constructions, regex tests and strncpy output padding.
It looked like I could safely move the redaction block past the repeated entries check and the repeated entries check would still work, so I've done so; hence the "spamming device" case is at least alleviated. Logs can still be spammed without strictly repeated entries, so I'll see about preparing some performance tests on Mac. Would definitely welcome testing on other platforms as well.
I did some stress testing and here are the results.
I created a browser source that emitted non-repeating console.error() messages between 100 and 700 characters in length, which each contained one or two sequences near the middle that would require redaction.
Called every 200ms, the weight of the logging function increased from .1% of CPU time to .5% of CPU time.
Called every 20ms, the weight of the logging function increased from 0.5% of CPU time to 4.7% of CPU time.
So, in its current state, without any optimization, and for what I would call a near-worst case workload, it seems like we're looking at roughly one order of magnitude more CPU use. For normal logging volumes, this is probably fine, but for cases like StreamElements log spam, it's a nontrivial amount, particularly given that logging occurs on the main thread.
I'm against redacting file paths in the local log file. It makes it impossible to check whether the file path is correct or not.
So, my suggestion is OBSBasic::UploadLog.
If the project prefers to change the local log file, I don't think do_log is the proper layer to redact. Instead of do_log, the implementations should be before calling blog, where it is obvious which string is file paths or URLs.
Instead of do_log, the implementations should be before calling blog, where it is obvious which string is file paths or URLs.
The issue here is that there are many places where this is would need to be implemented, leading to code duplication. Additionally it doesn’t cover third-party plugins.
Instead of changing all the log, we may consider another option that adds the redaction at UploadLog in the UI and also adds a UI to select file to upload the server instead of not only current and previous ones.
Instead of changing all the log, we may consider another option that adds the redaction at
UploadLogin the UI and also adds a UI to select file to upload the server instead of not only current and previous ones.
That still does not cover the "long tail" of situations where support channels need to see local log files, such as startup crashes (relatively common of late), hangs, or UI unresponsiveness. I would definitely agree that better log uploading would be beneficial, though.
I'd mainly also like to nail down the exact value of storing the sensitive information in log files for users. Certainly from a developer debugging standpoint, it is necessary to have non-redacted paths and URLs. But stdout and debug out are non-redacted. For user debugging, what is the exact value of a third-party debugger (or even a first party debugger, the user themselves) knowing the username segment of a file path? or RTMPS url parameters? It seems to me like the scope of debuggable issues with all the strings that are currently redacted is very small, and that, if there are such issues, it would be a better model to flag them at runtime and log them separately, e.g. log the statement "Unsupported character encoding in username" (as a hypothetical non-real example). I could also see a runtime option, --no-redact-log or similar, being useful for cases where non-redacted user output would be useful. Such a runtime option would be another reason to handle this at the do_log level.
situations where support channels need to see local log files, such as startup crashes (relatively common of late), hangs, or UI unresponsiveness.
I might suggest having a standalone executable that only redacts the log and uploads, which does not depend DLLs of Qt and other dependencies. This idea would require a change in support flow as well.
I could also see a runtime option, --no-redact-log or similar
I feel the runtime option makes sense but I don't know others might have different opinions.
nitpick: The regex constants can be placed in UI/log-redaction.cpp with a static qualifier instead of UI/log-redaction.hpp because these constants are referenced only in the source file.
Overall, we agree with this change and being something to work towards. We discussed in the UI/UX meeting, and we think there are a few things that need further exploration (not saying they will end up changing, just exploring options):
- Consider maybe moving logging to a separate thread so the potential increase in computation doesn't have ripple effects in high-usage scenarios
- Consider a different method of detecting/parsing the logs that's more efficient than regex parsing, or see if there's a more efficient regex that gets us to the same end result
- We would like for this to be redacted as-written-to-log, and not in some external tool. The entire premise that lead to this PR was users uploading logs manually that were not going through the "expected" workflows. We already have some of this redaction as part of the in-app upload, but not all support happens in our channels, and not all users use our uploader.
- Consider possibly disabling the redactions as part of --verbose logging (or, regrettable to suggest, a new launch flag)
OK, I'll work on some improvements and report back on impact or viability. I think first I'll look at how a separate thread for logging impacts performance, as well as a "first pass" on optimization (looking harder at if we need regex, ifdefs, other obvious things).
I also think a new launch option of --no-redact-log makes sense; I'm somewhat wary of putting it under --verbose. To me 'verbosity of debug info' and 'privacy of debug info' shouldn't be entwined, ideally.