command-line-api icon indicating copy to clipboard operation
command-line-api copied to clipboard

Is the mutable string in StringExtensions safe wrt deduplication and hashing?

Open KalleOlaviNiemitalo opened this issue 3 years ago • 6 comments

https://github.com/dotnet/command-line-api/pull/1510 made StringExtensions.Tokenize create a string and mutate it:

https://github.com/dotnet/command-line-api/blob/209b724a3c843253d3071e8348c353b297b0b8b5/src/System.CommandLine/Parsing/StringExtensions.cs#L232-L245

Is there a risk that a future version of the .NET Runtime garbage collector might deduplicate this string with some other String instance that happens to contain the same characters? Then the mutation would affect that other string too.

https://github.com/dotnet/runtime/issues/9022#issuecomment-332679956 suggests not deduplicating to a pinned string, but even so, there would be a small window between the String construction and the fixed statement.

Also, if .NET Runtime starts caching the String.GetHashCode() result in the String instance, then that will be out of date after the mutation, which could cause Dictionary<string, Token>.TryGetValue not to find the string. The runtime might skip this caching for short strings though, to save memory.

KalleOlaviNiemitalo avatar Jul 16 '22 11:07 KalleOlaviNiemitalo

Mutating strings like this is a bad practice and not supported by .NET runtime.

You should use string constructor that takes ReadOnlySpan<char> to avoid the allocation here.

jkotas avatar Jul 16 '22 18:07 jkotas

That constructor doesn't exist on netstandard2.0 though, so one would then need an #if here. Also, it would still allocate one string per iteration of the loop. Likely not significant though; users aren't going to bundle dozens of options in the same string.

A different strategy comes to mind: instead of collecting all known tokens into the Dictionary<string, Token>, check whether it's two characters long with '-' as the first character and place the second character in a separate Dictionary<char, Token>. That way, the unbundling loop would be able to look it up without constructing a string. But perhaps that would increase the initial JIT compilation cost more than the string allocation is worth.

KalleOlaviNiemitalo avatar Jul 16 '22 20:07 KalleOlaviNiemitalo

That constructor doesn't exist on netstandard2.0 though, so one would then need an #if here.

Yes, that's what we do in dotnet/runtime in situations like this.

jkotas avatar Jul 16 '22 20:07 jkotas

@KalleOlaviNiemitalo You can simulate this in a .NET Standard 2.0-compatible way without doing anything unsupported:

using System;
using System.Buffers;
using System.Text;

#pragma warning disable CS8500

public static unsafe class StringExtensions
{
    public delegate void SpanAction<T, in TArg>(Span<T> span, TArg arg);

    public static string Create<TState>(int length, TState state, SpanAction<char, TState> action)
    {
        var args = new CreateEncoding<TState>.State
        {
            state = &state,
            action = &action
        };
        try
        {
            return CreateEncoding<TState>.Instance.GetString((byte*)&args, length);
        }
        finally
        {
            GC.KeepAlive(state);
            GC.KeepAlive(action);
        }
    }

    public static string CreateString<TState>(this SpanAction<char, TState> action, int length, TState state)
    {
        return Create(length, state, action);
    }

    sealed class CreateEncoding<TState> : Encoding
    {
        public static readonly CreateEncoding<TState> Instance = new();

        public struct State
        {
            public TState* state;
            public SpanAction<char, TState>* action;
        }

        private CreateEncoding()
        {

        }

        public override int GetChars(byte* bytes, int byteCount, char* chars, int charCount)
        {
            if(byteCount != charCount)
            {
                throw new ArgumentException("Destination string was not properly allocated.", nameof(charCount));
            }
            var state = (State*)bytes;
            state->action->Invoke(new Span<char>(chars, charCount), *state->state);
            return charCount;
        }

        public override int GetCharCount(byte* bytes, int count)
        {
            return count;
        }

        public override int GetMaxCharCount(int byteCount)
        {
            return byteCount;
        }

        #region Unused conversions
        public override int GetMaxByteCount(int charCount)
        {
            throw new NotImplementedException();
        }

        public override int GetCharCount(byte[] bytes, int index, int count)
        {
            throw new NotImplementedException();
        }

        public override int GetByteCount(char[] chars, int index, int count)
        {
            throw new NotImplementedException();
        }

        public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
        {
            throw new NotImplementedException();
        }

        public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
        {
            throw new NotImplementedException();
        }
        #endregion
    }
}

This makes use of a virtual encoding whose GetChars method is used to call back to the SpanAction to populate the span, and behaves fully within the contracts of Encoding. On .NET Core, this even actually uses almost the same code as the normal String.Create method, so there are no performance downsides (unless the implementation changes Encoding.GetString). It uses pointers since that is what Encoding works with, but it is perfectly safe ‒ all pointers point to the stack, so there is no need to be concerned with pinning, and while it poses as a byte*, the runtime does not care at all what it points to, since GetChars is the only method supposed to read the memory.

IS4Code avatar Mar 21 '24 20:03 IS4Code

@IS4Code, that is a clever hack, but I don't think it should be used in the StringExtensions method that I linked above. There, it would be simpler to stackalloc a char[2] and use the string(char*, int, int) constructor. That would not need any virtual method calls.

KalleOlaviNiemitalo avatar Mar 22 '24 08:03 KalleOlaviNiemitalo