xunit icon indicating copy to clipboard operation
xunit copied to clipboard

Xunit doesn't correctly serialize unicode strings in theory data

Open drognanar opened this issue 5 years ago • 3 comments

This is related to https://developercommunity.visualstudio.com/content/problem/696704/visual-studio-2019-1622-test-explorer-and-live-uni.html

Theories that include unicode characters in the theory data fail when run from VS. That happens as the serialization helper doesn't round trip unicode strings correctly.

The example below demonstrates the behaviour. The ClassWithUnicodeCharacters contains the test that fails the serialize/deserialize round trip.

    class ClassWithUnicodeCharacters
    {
        public static IEnumerable<object[]> StringTestData = new[] { new object[] { "\uD800" } };

        [Theory]
        [MemberData("StringTestData")]
        public void Test(string x) { }
    }

I verified the behaviour by adding a new unit test to SerializationTests.cs and building xunit. Here's the unit test that I think should be passing. If the StringTestData contains regular characters such as "str" the test passes.

    [Fact]
    public static void TheoryWithUnicode()
    {
        var sourceProvider = new NullSourceInformationProvider();
        var assemblyInfo = Reflector.Wrap(Assembly.GetExecutingAssembly());
        var discoverer = new XunitTestFrameworkDiscoverer(assemblyInfo, sourceProvider, SpyMessageSink.Create());
        var sink = new TestDiscoverySink();

        discoverer.Find(typeof(ClassWithUnicodeCharacters).FullName, false, sink, TestFrameworkOptions.ForDiscovery());
        sink.Finished.WaitOne();

        var test = sink.TestCases[0];
        var roundTripped = SerializationHelper.Deserialize<ITestCase>(SerializationHelper.Serialize(test));

        Assert.Equal("\uD800", roundTripped.TestMethodArguments[0]);
    }

drognanar avatar Oct 03 '19 22:10 drognanar

Are there any workarounds while we wait for a solution?

aluRamb0 avatar Nov 13 '19 10:11 aluRamb0

The workaround today would be to disable theory pre-enumeration ([MemberData(..., DisableDiscoveryEnumeration = true)]) for any problematic data.

bradwilson avatar Nov 13 '19 14:11 bradwilson

I am facing this issue now with InlineData, it's oddly causing my test to fail on a remote machine but succeed locally. I'm using the é character in a parameter string. This still hasn't been fixed?

vargonian avatar Dec 19 '21 03:12 vargonian

This seems like a framework/unicode issue because this test fails:

[Fact]
public static void ExampleDoesNotRoundtrip()
{
    var i = "\ud800";
    var d = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(i));
    Assert.Equal(i, d);
}

If we run the original test with a data point that does roundtrip (e.g. `"\ua800") everything works out fine.

@bradwilson I suggest closing this issue.

koenigst avatar Jan 24 '23 12:01 koenigst

There is a related discussion here: https://github.com/xunit/xunit/discussions/2626

Non-Unicode legal strings will get "mangled" when converted to Unicode during the serialization process because we convert to UTF-8. A single D800 is, by itself, not legal Unicode.

A second workaround is to use character arrays for non-Unicode data, and then convert them back into non-Unicode strings yourself in the test:

public static TheoryData<char[]> CharArrayTestData = new() { "\uD800".ToCharArray() };

[Theory]
[MemberData(nameof(CharArrayTestData))]
public void MyTest(char[] data)
{
    var dataAsString = new string(data);
    // ...
}

As noted in the discussion, I'm leaning towards "by design" for the existing behavior because of the serialization costs associated with us using character arrays full time (the size of the serialized data roughly doubles, assuming most of your string's characters would fit inside a single 8-bit value in UTF-8, which is true for the majority of Latin-based languages).

bradwilson avatar Jan 25 '23 01:01 bradwilson

Closing as "by design".

bradwilson avatar May 04 '23 01:05 bradwilson