NUlid icon indicating copy to clipboard operation
NUlid copied to clipboard

Guid conversion doesn't take into account Microsoft's mixed-endian encoding

Open kdawg14 opened this issue 4 years ago • 8 comments

Hey, thanks for providing this library; it's been very useful to me.

I happened to notice that the Guids that were converted from certain Ulids (and vice-versa) didn't maintain the proper lexicographical sort order that I would have expected. I believe that this is because of Microsoft's screwy way of encoding the byte arrays (https://stackoverflow.com/a/16722909).

I forked and made a commit that appears to fix this at https://github.com/kdawg14/NUlid/commit/c5ae7c8cbe4a1bbc041e0bbb0ff98faa005bb59b

I'm not certain exactly how you like to handle pull requests, or if this was something that was intentionally done the way that it was, so I figured I'd post this here and see if this looks like a good correction or if you maybe have any additional information on the subject.

Thanks!

kdawg14 avatar Feb 10 '21 04:02 kdawg14

I'm not sure I agree. What MS does when representing GUID's as a string is irrelevant I think. There's no 'defined' way to convert a ULID to a GUID and vice versa - as long as it's consistent. Nowhere is it guaranteed that the GUID and ULID lexicographicallity is the same before/after conversion from one to the other.

If I understand correctly you're basing your findings on the lexicographical order of the string ("hex") representation of a GUID, but who's to say guid 00000000-0001-0000-0000-000000000000 comes before/after 00000000-0002-0000-0000-000000000000? It may, from a string representation point of view, but it may not from a binary point of view. And most systems (like RDBMS'es) will sort on the binary representation (assuming the GUID's are stored correctly and not as varchar/string/whatever).

So I'm not sure we should change this, but I'm open to discussion.

Having said that: I didn't know/realize MS did such hocus-pocus on the first few bytes, so, either way: Today I learned.

RobThree avatar Feb 10 '21 11:02 RobThree

Thanks for the quick response. I'm not certain that I agree either, so thanks for the feedback.

However, I don't think that it's just the hex string representation of the GUID that's the issue, though that does show the problem the same way that the binary form would/does. I didn't do a close inspection of the binary, but I assume that the Guid.CompareTo method used in the tests is operating off that, and it shows results consistent with the lexicographical ordering of the hex string.

My understanding (which mostly consists of reading the StackOverflow and Wikipedia pages that I linked to in the comments, as well as the included tests, so it's likely far from perfect) is that when serializing/deserializing GUIDs, you'd need to correct the endianness in the source or result byte arrays to maintain the UUID as it would normally be represented in serialized form.

Since the ULID spec is a subset of the UUID spec, it seems reasonable to expect that it could be converted between different representations of a UUID while maintaining the same sort order, but I admit that is an assumption that I don't know is definitely true.

Thanks again!

kdawg14 avatar Feb 10 '21 14:02 kdawg14

Since the ULID spec is a subset of the UUID spec

It's not?

RobThree avatar Feb 10 '21 16:02 RobThree

I guess more accurately, I meant that it has 128-bit compatibility with UUID, and so converting to/from ULID from/to another 128-bit UUID should be able to maintain the same sortability, unless there are other considerations that I'm not thinking of.

kdawg14 avatar Feb 10 '21 17:02 kdawg14

https://github.com/rmalayter/ulid-mssql points out some reasons to follow the odd Microsoft byte order when converting to GUIDs for storage in MS SQL Server: the sort order in database queries matches the ULID sort order and possibly more importantly the sort order in indexes matches the ULID sort order. In microbenchmarks the "properly" sorted GUID versions of ULIDs result in better clustered indexes than baseline (v6) GUIDs.

Interoperability with ulid-mssql and C# code would be potentially extremely useful to me.

WorldMaker avatar Apr 22 '21 21:04 WorldMaker

One issue I have with changing this is that it breaks the current implementation. Whatever .ToGUID() returns will change as will creating the ULID's from GUID's. If anything I thing we should consider creating an overload with an enum that specifies which behaviour we want ("old" vs "new") and default to old behaviour.

Second thing I would change is the 3 calls to Array.Reverse(); I'm sure simply returning the bytes in the correct order like this performs much better:

private static byte[] FixGuidByteArrayEndianness(byte[] bytes)
{
    return new[] { bytes[3], bytes[2], bytes[1], bytes[0], bytes[5], bytes[4], ... } // Or whatever the order is...
}

I'll sit on it for a day or two and get back to this.

RobThree avatar Apr 22 '21 21:04 RobThree

I was reminded that Raymond Chen has a great overview of the multiple ways Microsoft sorts GUIDs: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=102450

One interesting bit there is the reminder that System.Data.SqlGuid exists as a separate struct. https://docs.microsoft.com/en-us/dotnet/api/system.data.sqltypes.sqlguid?view=net-5.0

Instead of an enum to define the two behaviors, you could use SqlGuid versus Guid. I'm not sure if there might be pitfalls there, but it might get the idea across that the behavior difference is primarily because of the difference between general use Guids and ones intended for SQL Server.

WorldMaker avatar Apr 26 '21 21:04 WorldMaker

I took a first stab at it as extension methods to explore the idea, especially with respect to EF Core:

    public static class UlidExtensions
    {
        public static SqlGuid ToSqlGuid(this Ulid ulid)
        {
            var bytes = ulid.ToByteArray();
            return new SqlGuid(new byte[] { bytes[6], bytes[7], bytes[8], bytes[9], bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15], bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5] });
        }

        public static Ulid ToUlid(this SqlGuid guid)
        {
            var bytes = guid.ToByteArray();
            return new Ulid(new byte[] { bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15], bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7], bytes[8], bytes[9] });
        }
    }

Apparently EF Core doesn't support SqlGuid as a data type, so in the EF Core Value Converter I have to ulid.ToSqlGuid().Value and new SqlGuid(guid).ToUlid(), so I'm not sure using the SqlGuid type as an intermediary is as strong of an idea as it first seemed now that I have tested it against EF Core. I haven't yet double checked that it round-trips with ulid.sql, but that's next for me to double check that I got the byte reordering correct.

WorldMaker avatar Apr 28 '21 00:04 WorldMaker