docs
docs copied to clipboard
[Breaking change]: In .NET 8 Stream Reader emits Unicode replacement character, .NET 7 did not
Description
When a StreamReader with default constructor (UTF-8) encounters a UTF-8 character that is broken in half (one particular kind of invalid UTF-8 byte sequence), the handling changed from .NET 7 to .NET 8. I wasn't able to find docs mentioning this change.
Repro code:
using System.Runtime.InteropServices;
using System.Text;
using System.Text.Json;
var str = " \u00B7 ";
var bytes = Encoding.UTF8.GetBytes(str);
Console.WriteLine("Framework: " + RuntimeInformation.FrameworkDescription);
for (var i = 1; i <= bytes.Length; i++)
{
var range = bytes[0..i];
var readByStreamReader = new StreamReader(new MemoryStream(range)).ReadToEnd();
Console.WriteLine(JsonSerializer.Serialize(readByStreamReader));
}
Output in .NET 7 (no replacement character emitted):
Framework: .NET 7.0.14
" "
" "
" "
" \u00B7"
" \u00B7 "
" \u00B7 "
Output in .NET 8 (replacement character emitted)
Framework: .NET 8.0.0
" "
" "
" \uFFFD"
" \u00B7"
" \u00B7 "
" \u00B7 "
Version
.NET 8 GA
Previous behavior
I noticed this on .NET 8 GA. I did not test .NET 8 previews.
New behavior
A \uFFFD character (Unicode replacement character) is emitted by the StreamReader now. Previously nothing was emitted.
Type of breaking change
- [ ] Binary incompatible: Existing binaries may encounter a breaking change in behavior, such as failure to load or execute, and if so, require recompilation.
- [ ] Source incompatible: When recompiled using the new SDK or component or to target the new runtime, existing source code may require source changes to compile successfully.
- [X] Behavioral change: Existing binaries may behave differently at run time.
Reason for change
Product team can provide details I think.
Recommended action
Document the change.
Feature area
Globalization
Affected APIs
System.IO.StreamReader
@stephentoub Can you take a look at this issue and see if it should be documented as a breaking change?
When I ran this code in my Linux and macOS build, it seems this behavior change may be observed differently by users based on their platform or Unicode implementation, e.g. Windows ICU vs NLS.
using System.Globalization;
using System.Runtime.InteropServices;
Console.WriteLine("IcuMode: " + IcuMode());
Console.WriteLine("Framework: " + RuntimeInformation.FrameworkDescription);
Console.WriteLine("Culture EndsWith: " + "Code\uFFFD".EndsWith("Code", StringComparison.CurrentCulture));
static bool IcuMode()
{
SortVersion sortVersion = CultureInfo.InvariantCulture.CompareInfo.Version;
byte[] bytes = sortVersion.SortId.ToByteArray();
int version = bytes[3] << 24 | bytes[2] << 16 | bytes[1] << 8 | bytes[0];
return version != 0 && version == sortVersion.FullVersion;
}
Windows + ICU enabled:
IcuMode: True
Framework: .NET 8.0.0
Culture EndsWith: False
Linux + macOS + Windows ICU enabled:
IcuMode: True
Framework: .NET 8.0.0
Culture EndsWith: False
IcuMode: False
Framework: .NET 7.0.14
Culture EndsWith: True
To be clear, it seems culture-based handling of the replacement character seems consistent between .NET 7 and .NET 8 (i.e. ICU doesn't ignore the replacement character like NLS) but it means that perhaps the above behavior change is more impactful on ICU runtimes than NLS ones.
I found this with an Assert.EndsWith Xunit assertion. I will just add Ordinal comparison to the assertion to make it consistent cross-plat.
@stephentoub Can you take a look at this issue and see if it should be documented as a breaking change?
@GrabYourPitchforks, does this look familiar?
Likely caused by https://github.com/dotnet/runtime/pull/69888.