csharplang icon indicating copy to clipboard operation
csharplang copied to clipboard

Champion "utf8 string literals"

Open gafter opened this issue 7 years ago • 58 comments

  • [x] Proposal added
  • [ ] Discussed in LDM
  • [ ] Decision in LDM
  • [ ] Finalized (done, rejected, inactive)
  • [ ] Spec'ed

Proposal: https://github.com/dotnet/csharplang/blob/main/proposals/csharp-11.0/utf8-string-literals.md Old draft proposal: https://github.com/dotnet/csharplang/issues/2911

Design Review

https://github.com/dotnet/csharplang/blob/main/meetings/2021/LDM-2021-10-27.md#utf-8-string-literals https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-01-26.md#open-questions-in-utf-8-string-literals https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-06-29.md#utf-8-literal-concatenation-operator

gafter avatar Feb 26 '17 18:02 gafter

UTF-8 encodings come with a lot of issues for non-english languages and developers. So this feature might only be a good thing for english developers and a bad idea for everybody else.

It might be useful for interop with non-unicode applications, but then I prefer to have an explicit encoding conversation using System.Text.Encoding.

MovGP0 avatar Feb 27 '17 10:02 MovGP0

@MovGP0 I think this is related to https://github.com/dotnet/corefxlab/blob/master/docs/specs/parsing.md . UTF8 strings are very common and if you don't have to convert to UTF16 (== .NET strings) and back again you save memory and CPU.

0xd4d avatar Feb 27 '17 23:02 0xd4d

When UTF-8 string literals are added it would be nice to have UTF-8 version of StringBuilder as well.

wanton7 avatar Jun 02 '17 08:06 wanton7

@wanton7 This may be it: https://github.com/dotnet/corefxlab/tree/master/src/System.Text.Formatting/System/Text/Formatting/Formatters

ufcpp avatar Jun 02 '17 15:06 ufcpp

I am quite curious about how it is going to handle the index operator. utf8string utf8 = "©α中文𨙧: some regular Chinese and special characters “; utf8char utf8c = utf8[5];

Does it mean the class need to enumerate and decode bytes to utf8characters inside until it finds the sixth character? Or, are you disallowing the index operator on the UTF8String?

Here are the similar questions:

  • Will there be any function like Substring, Trim, EndsWith?
  • Will there be any optimization when we have reversed looping, like this? for (var i= utf8.Length - 1; i>=0; i--) { /* do something */ }

If there no support or optimization for this, I would probably say probably we just need a syntax sugar for converting a string to a byte array with a UTF8 encoding.

sumtec avatar Nov 27 '18 07:11 sumtec

Those questions are better suited for corefx.

yaakov-h avatar Nov 27 '18 07:11 yaakov-h

@sumtec and @MovGP0 .NET Micro Framework always had UTF-8 string implementation only, transparent to the developer. It does support trimming and substrings and indexing, although the reverse looping is not optimized (source). It saved memory. You could, however, have the same arguments around "normal" strings with surrogate pairs.

miloush avatar May 06 '19 12:05 miloush

See https://github.com/dotnet/csharplang/issues/2911 for a minimal specification for this feature.

gafter avatar Oct 25 '19 22:10 gafter

Will there be a type that represents potentially invalid UTF-8 strings, like Linux file paths?

orthoxerox avatar Oct 26 '19 08:10 orthoxerox

I'm not a fan of this approach as it treats utf8 strings as something other that then needs to be brought in through a side-channel.

It seems this fundamentally could not be picked up by a library author. i.e. if i have a library and i'm already using System.String (highly highly likely), i can't switch to utf8 strings because it will break all my consumers.

And, if i don't use utf8 strings, similarly my consumers will be less likely to as well since they would not want the costs marshalling to/from all libs.

--

I talked to @jcouv about this and the approach that feels like it would be most likely to succeed would be to provide a way to switch the .net runtime to/from utf8 mode (on a process boundary most likely). The benefits here are:

  1. users can switch over everything entirely to utf8 when it is acceptable for their domain.
  2. most apps would immediately get a near 50% reduction in memory for all their strings (iirc measurements showed that 90%+ of all strings are simple ascii).
  3. apps/libraries get switched over all at once based on the needs of the final consumers.

There is a downside in this that often gets brought up. Namely that utf8 strings do have different perf behavior for some ops over strings (namely indexing). However, this doesn't actually seem like a critical problem to me. First, remember that what i'm proposing involves a switch (either opt-in or opt-out) to use ut8 across the board. As such, if someone is in a domain where they index heavily and get a perf hit, they can not use utf8 until they address that problem. Second, i think the problem seems somewhat overblown in terms of how bad it is. We can likely break string indexing up into two domains:

  1. people streaming through a string with monotonically increasing indexes. This can be addressed by: 1.1 pushing those people (with analyzers) to use iterators instead. 1.2 having the runtime be slightly smarter with string indexing. like many utf8 systems out there it could store additional information in the runtime about the last index operation that happened on hte last few strings. If the user passes in str[i] and then str[i + 1] the information about the locatin collected in the first op can be used to make the second fast.
  2. people randomly accessing string indices. This seems like this would be a very small subset of users. And, if that space was truly important, they: 2.1 could opt-out of utf8 strings 2.2 could use some new type that guaranteed constant random access for a string. maybe a new Utf16String, or just a char[] or ImmutableArray<char>

Basically, it feels like there is a path that can get us to a future where almost everyone (final consumers and libraries alike) are on utf8 and the entire ecosystem gets the massive memory savings. It comes at the complexity of having opt-in/out and potentially needing some analyzers/classes for the people using strings in uncommon ways today. However, it seems much better to me than introducing a new utf8 string type that is highly unlikely to be picked up.

CyrusNajmabadi avatar Oct 26 '19 18:10 CyrusNajmabadi

As an example of how we have a problem, take a look at Roslyn itself, including the entire Roslyn API we ship.

  1. it is massively System.String based everywhere.
  2. It uses a huge amount of memory internally with strings. IIRC measurements have shown it's >50% of our memory usage in compiler and IDE.

How could Roslyn itself possibly get the benefits of utf8 strings?

  1. We could try switching to it internally, but our marshalling points between the internal and public layers would kill us. For example, every time the IDE accessed a string-property exposed by the compiler, we would take a marshalling hit. And we access those string-properties continuously.
  2. We could try to expose both types of strings somehow? allowing consumers to move to utf8 when possible, while still having the System.String property. But how would this look? .Name and .Name8? How would memory not explode in such a world?
  3. We could switch wholesale over to utf8 strings for our entire surface area. But that would break 100% of the ecosystem out there.

Effectively, afaict, a project like Roslyn could never move to utf8. And we're one of the projects that would benefit the most here. We likely would save gigabytes of memory on real projects on user boxes.

So, as mentioned in teh start, this overall approach seems highly limited and constraining. it will only help projects that are isolated and can completely switch over without having to worry about dependencies. The overall ecosystem will find it nearly impossible to switch.

Conversely, the approach I outlined gives a path forward that allows big saving immediately across the board, with appropriate mechanisms for people to deal with rare problems if they arise. Then, if problems do occur in some places, they can be fixed up without holding the rest of the ecosystem back.

CyrusNajmabadi avatar Oct 26 '19 18:10 CyrusNajmabadi

@CyrusNajmabadi Ist there an ongoing discussion on your "side-channel" proposal without introducing a new UTF8String type?

I share your concerns about the fragmentation problem a new UTF8String type would bring. However I could only find the discussion around design of the new types utf8 types https://github.com/dotnet/corefxlab/issues/2350 and the older compact string proposal: https://github.com/dotnet/coreclr/issues/7083

davidroth avatar Oct 30 '19 12:10 davidroth

@CyrusNajmabadi Ist there an ongoing discussion on your "side-channel" proposal without introducing a new UTF8String type?

No clue. @jcouv @gafter is there any hope of this being not a side-channel type? note: personally, i think this is an appropriate hill to die on. It is that important.

CyrusNajmabadi avatar Oct 30 '19 19:10 CyrusNajmabadi

@CyrusNajmabadi That is a question for corefxlab,coreclr, and corefx, possibly focused at https://github.com/dotnet/corefxlab/issues/2350. This proposal isn't going anywhere without that team making a decision about what they want to do to support UTF-8. If the answer is a new type, this proposal applies.

@orthoxerox Re "Will there be a type that represents potentially invalid UTF-8 strings, like Linux file paths?". Are you asking about ReadOnlySpan<byte>?

gafter avatar Oct 30 '19 19:10 gafter

@gafter will System.IO classes use ReadOnlySpan<byte>?

Like, IEnumerable<ReadOnlySpan<byte>> System.IO.Directory.EnumerateFiles(ReadOnlySpan<byte> path)?

orthoxerox avatar Oct 30 '19 19:10 orthoxerox

@orthoxerox You would have to ask the folks designing those APIs.

gafter avatar Oct 30 '19 21:10 gafter

I keep going back and forth on this... @CyrusNajmabadi's concerns are absolutely what I have felt to be the biggest downside, and I share the opinion that indexing into the string is not a major concern: I don't see indexing UTF-16 code units as being much different from indexing UTF-8 code units, as both encodings are variable-length, and so one code unit does not always represent one code point (nevermind that one code point does not always represent one character, depending on what the developer has in mind when they talk about "the third character in this string").

I mean, if you were to ask me, "hey @airbreather, if you were designing C# / .NET from scratch, what encoding would you use to store character data in string?", then I would say "UTF-8" without a hint of hesitation (I feel more strongly about this point than I do about array covariance being a mistake). But there's so much momentum behind UTF-16 strings that I can't unequivocally support this proposal: introducing UTF-8 companion types to today's UTF-16 string / char has a very real risk of harming performance, as the majority of the users of the UTF-8 stuff would wind up marshaling anyway to interop with third-party code that uses UTF-16 (edit: at least in the short-term until adoption picks up).

I'm also not terribly optimistic that this will really bear fruit without also investing significantly in CoreFX to add comprehensive first-class support, like what was done for Span<T> / ReadOnlySpan<T> / Memory<T> / ReadOnlyMemory<T>, and I can definitely imagine that major established third-party libraries would not share my enthusiasm for adding parallels in their public API surface.

Ultimately, however, I've settled on a :+1: for this. I personally have a phobia about wasting CPU cycles and virtual memory bytes, so if LDT thinks that, in spite of the concerns raised here, this is something that has a realistic chance of making UTF-8 more of a first-class member of our ecosystem, then I'd be delighted to see this next important step towards breaking the chicken-and-egg feedback loop of:

  • The language and out-of-the-box APIs make it much easier to use UTF-16 than UTF-8, so practically every library and application uses UTF-16 for their strings, and
  • Practically everybody uses UTF-16, so most investments in the language and out-of-the-box APIs tend to go towards making things easier for people who have UTF-16 strings

We could try to expose both types of strings somehow? allowing consumers to move to utf8 when possible, while still having the System.String property. But how would this look? .Name and .Name8? How would memory not explode in such a world?

@CyrusNajmabadi in this example, would it be viable for Roslyn to use .Name8 as the actual storage, but keep the existing .Name properties around with accessors that marshal to/from UTF-16 on demand?

  • Performance-sensitive code paths in the IDE (and elsewhere) would be highly encouraged to switch to .Name8, and there could be Roslyn-specific analyzers that help identify these.
  • You get the performance benefits of the "hard break" proposal, without behavior changes.

Admittedly, the prospect of ~doubling the public API surface alone may be enough to kill this idea...

airbreather avatar Nov 16 '19 00:11 airbreather

Admittedly, the prospect of ~doubling the public API surface alone may be enough to kill this idea...

Yes. It seems like it would just be awful :-/

CyrusNajmabadi avatar Nov 16 '19 02:11 CyrusNajmabadi

Would it be possible to make a UTF8String Implementation that extends the Type String, so no changes to the API's. The UTF8String would become an implementation detail. Maybe even other implementations of String would become possible. Like strings based on a Span<T>

Or wait until "Type Classes" are introduced and then make a String "Type Class" https://github.com/dotnet/csharplang/issues/110

inforithmics avatar Dec 06 '19 13:12 inforithmics

@inforithmics Almost any change to System.String would be a breaking change. For example, String has a contract that its chars can be accessed by index in constant time. That is not true of utf8 strings, so we do not want to expose that same API for them.

gafter avatar Dec 06 '19 19:12 gafter

You are totally right, In my opinion the situation could only be simplified with Type Classes of C# 10. Where there could be added a String "Type Class" and the String interfaces would accept this and it wouldn't matter if it is a String or an UTF8String.

I read a little about other programming languages that moved from one string representation to another and I stumbled upon swift, where they changed the representation from utf16 to UTF8 going to version 5 of swift https://swift.org/blog/utf8-string/ They had the advantage to already had a base string class with different implementations so it simplified things for them, but it was still an ABI breakage.

The reason I'm suggesting a sort of base String type of kind is that the current implementation is not very friendly to large strings. https://mattwarren.org/2016/05/31/Strings-and-the-CLR-a-Special-Relationship/ because it needs large continuous blocks of memory. So if efficient String storage is requested the only possibility is at the moment to use (jagged) byte arrays, with wrapper methods. Because continued large blocks of memory are a problem in a .Net process with pinned objects. Because they cannot be rearranged. And this happens in native .net interop scenarios a lot.

inforithmics avatar Dec 08 '19 16:12 inforithmics

Isn't this a subset of a much larger feature - deterministic functions - and already completed work in .NET 6 by means of unfolding constants?

static byte[] _helloWorldUtf8Bytes = Encoding.UTF8.GetBytes("Hello world");
// is JITted to
static byte[] _helloWorldUtf8Bytes = new byte[] { .... };

TahirAhmadov avatar Nov 11 '21 16:11 TahirAhmadov

No. See the motivation section of the proposal, which mentions that exact pattern. There are startup costs as the JIT has to do the conversion, and you still have to pay memory for the UTF-16 representation you're never going to use.

333fred avatar Nov 11 '21 16:11 333fred

No. See the motivation section of the proposal, which mentions that exact pattern. There are startup costs as the JIT has to do the conversion, and you still have to pay memory for the UTF-16 representation you're never going to use.

These 2 reasons are also applicable to justify adding "constant expressions" to C#, not just for UTF-8 strings but everything else:

class UTF8Encoding // or w/e it's called
{
  public static deterministic byte[] GetBytes(string s)  { ... }
}
static readonly byte[] _bytes = Encoding.UTF8.GetBytes("Hello world");
// is compiled by C# to this IL:
static readonly byte[] _bytes = new byte[] { .... };

PS. The proposal talks about static readonly, but not JIT unfolding. But even if it did, my point above still stands.

TahirAhmadov avatar Nov 11 '21 17:11 TahirAhmadov

Follow up from a conversation with @333fred and @tannergooding on the C# Discord (#lowlevel)

Wanted to comment and add that it would be awesome if [CallerMemberName], [CallerArgumentExpression], and nameof expressions got support for target parameters of type ReadOnlySpan<byte> (and byte[] too if we wanted consistency there).

To provide a practical example and some context on this, we could leverage this in the Store to make our managed trace logger providers more efficient. We've been migrating our remaining C++ code to C#, and one of the things I'm currently working on is a managed version of our trace logging provider. This is a manifest-less ETW provider, meaning it needs each event to also get a metadata binary blob encoding all parameters being passed in the actual event data descriptors. C++ uses some macros to achieve this, whereas in C# I've come up with a builder-like approach that lets you build the metadata and event descriptor buffers in a declarative way. It looks something like this (I've garbled up the various names):

using TracingDataBuilder builder = TracingDataBuilder.Create();

builder.AppendEventTagAndName(SOME_TAG.BAR); // [CallerMemberName]
builder.AppendWStringKeyValuePair(someText, someTextLength, "Some literal");
builder.AppendWStringKeyValuePair(someOtherText, someOtherTextName); // [CallerArgumentExpression]
builder.AppendWStringKeyValuePair(someId, someIdLength); // [CallerArgumentExpression]
builder.AppendWStringKeyValuePair(someType, someTypeLength); // [CallerArgumentExpression]
builder.AppendBoolKeyValuePair(&someBoolParameter, nameof(someBoolParameter));
builder.AppendBoolKeyValuePair(&someOtherBoolParameter, nameof(someOtherBoolParameter));
builder.AppendInt32KeyValuePair(&someIntParameter, nameof(someIntParameter));
builder.AppendInt32KeyValuePair(&someOtherIntParameter, "Some other literal");

_ = builder.EventWriteTransfer(_traceLogger, descriptor, null, null);

There are 3 ways each parameter name is passed, as you can see:

  • [CallerMemberName] or [CallerArgumentExpression]
  • nameof
  • String literal

Now, the parameter name in the metadata blob needs to be encoded as a UTF8 string, meaning that currently I need to encode each parameter name into the target buffer. This is still zero-allocation (using the Encoding.GetBytes overload taking a target range), but not the fastest). If we could instead change those parameters to just be ReadOnlySpan<byte>, the builder could instead just blit their contents directly into the target metadata buffer, without having to do any conversion at runtime.

Without support for those 3 scenarios, the alternative would be to either just keep using a string and do the conversion at runtime (slow), or always use an UTF8 string literal, meaning the code would end up being much more verbose and error prone (string literals everywhere). It'd be great if the new UTF8 string was just extended to support the existing scenarios here 😄

EDIT: if all of these features couldn't be added, having support for just nameof would at least be a major win, as it'd avoid having to pass hardcoded string literals everywhere, which is particularly error prone.

Sergio0694 avatar Mar 23 '22 21:03 Sergio0694

Some feedback/questions about the design were raised here: https://github.com/dotnet/csharplang/discussions/5983

tannergooding avatar Apr 04 '22 22:04 tannergooding

Assuming that the natural type of "literal"u8 is byte[] or ROS<byte>, is there going to be some kind of marker within the assembly data that says "this is a UTF-8 literal" vs. "this is an instantiation of some binary data blob"?

I'm specifically thinking of disassembly / debugging / diagnostic scenarios. If a decompiler sees this in the IL stream:

ldc.i4.5
newarr [System.Runtime]System.Byte
dup
ldtoken field valuetype <foo>
call <initialize_array_helper>

Which of these two should that decompile into?

byte[] a = new byte[] { 0x48, 0x65, 0x6C, 0x6C, 0x6F };
byte[] b = "Hello"u8;

A marker somewhere that a diagnostic tool could inspect would prevent guessing and would ensure that the tool displays the correct human-friendly representation.

GrabYourPitchforks avatar Apr 15 '22 04:04 GrabYourPitchforks

Tagging @AlekseyTs for Levi's question about decompilation and debugger representation.

jcouv avatar Apr 15 '22 06:04 jcouv

is there going to be some kind of marker within the assembly data that says "this is a UTF-8 literal" vs. "this is an instantiation of some binary data blob"?

At the moment there are no plans to have any markers like that.

AlekseyTs avatar Apr 15 '22 13:04 AlekseyTs

At the moment there are no plans to have any markers like that.

Thanks for the response. Is this being tracked anywhere for future implementation, with criteria for what would move it above the cut line, or is this more of a "we're not interested in ever doing this" thing?

GrabYourPitchforks avatar Apr 15 '22 15:04 GrabYourPitchforks