Guid conversion doesn't take into account Microsoft's mixed-endian encoding
Hey, thanks for providing this library; it's been very useful to me.
I happened to notice that the Guids that were converted from certain Ulids (and vice-versa) didn't maintain the proper lexicographical sort order that I would have expected. I believe that this is because of Microsoft's screwy way of encoding the byte arrays (https://stackoverflow.com/a/16722909).
I forked and made a commit that appears to fix this at https://github.com/kdawg14/NUlid/commit/c5ae7c8cbe4a1bbc041e0bbb0ff98faa005bb59b
I'm not certain exactly how you like to handle pull requests, or if this was something that was intentionally done the way that it was, so I figured I'd post this here and see if this looks like a good correction or if you maybe have any additional information on the subject.
Thanks!
I'm not sure I agree. What MS does when representing GUID's as a string is irrelevant I think. There's no 'defined' way to convert a ULID to a GUID and vice versa - as long as it's consistent. Nowhere is it guaranteed that the GUID and ULID lexicographicallity is the same before/after conversion from one to the other.
If I understand correctly you're basing your findings on the lexicographical order of the string ("hex") representation of a GUID, but who's to say guid 00000000-0001-0000-0000-000000000000 comes before/after 00000000-0002-0000-0000-000000000000? It may, from a string representation point of view, but it may not from a binary point of view. And most systems (like RDBMS'es) will sort on the binary representation (assuming the GUID's are stored correctly and not as varchar/string/whatever).
So I'm not sure we should change this, but I'm open to discussion.
Having said that: I didn't know/realize MS did such hocus-pocus on the first few bytes, so, either way: Today I learned.
Thanks for the quick response. I'm not certain that I agree either, so thanks for the feedback.
However, I don't think that it's just the hex string representation of the GUID that's the issue, though that does show the problem the same way that the binary form would/does. I didn't do a close inspection of the binary, but I assume that the Guid.CompareTo method used in the tests is operating off that, and it shows results consistent with the lexicographical ordering of the hex string.
My understanding (which mostly consists of reading the StackOverflow and Wikipedia pages that I linked to in the comments, as well as the included tests, so it's likely far from perfect) is that when serializing/deserializing GUIDs, you'd need to correct the endianness in the source or result byte arrays to maintain the UUID as it would normally be represented in serialized form.
Since the ULID spec is a subset of the UUID spec, it seems reasonable to expect that it could be converted between different representations of a UUID while maintaining the same sort order, but I admit that is an assumption that I don't know is definitely true.
Thanks again!
Since the ULID spec is a subset of the UUID spec
It's not?
I guess more accurately, I meant that it has 128-bit compatibility with UUID, and so converting to/from ULID from/to another 128-bit UUID should be able to maintain the same sortability, unless there are other considerations that I'm not thinking of.
https://github.com/rmalayter/ulid-mssql points out some reasons to follow the odd Microsoft byte order when converting to GUIDs for storage in MS SQL Server: the sort order in database queries matches the ULID sort order and possibly more importantly the sort order in indexes matches the ULID sort order. In microbenchmarks the "properly" sorted GUID versions of ULIDs result in better clustered indexes than baseline (v6) GUIDs.
Interoperability with ulid-mssql and C# code would be potentially extremely useful to me.
One issue I have with changing this is that it breaks the current implementation. Whatever .ToGUID() returns will change as will creating the ULID's from GUID's. If anything I thing we should consider creating an overload with an enum that specifies which behaviour we want ("old" vs "new") and default to old behaviour.
Second thing I would change is the 3 calls to Array.Reverse(); I'm sure simply returning the bytes in the correct order like this performs much better:
private static byte[] FixGuidByteArrayEndianness(byte[] bytes)
{
return new[] { bytes[3], bytes[2], bytes[1], bytes[0], bytes[5], bytes[4], ... } // Or whatever the order is...
}
I'll sit on it for a day or two and get back to this.
I was reminded that Raymond Chen has a great overview of the multiple ways Microsoft sorts GUIDs: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=102450
One interesting bit there is the reminder that System.Data.SqlGuid exists as a separate struct. https://docs.microsoft.com/en-us/dotnet/api/system.data.sqltypes.sqlguid?view=net-5.0
Instead of an enum to define the two behaviors, you could use SqlGuid versus Guid. I'm not sure if there might be pitfalls there, but it might get the idea across that the behavior difference is primarily because of the difference between general use Guids and ones intended for SQL Server.
I took a first stab at it as extension methods to explore the idea, especially with respect to EF Core:
public static class UlidExtensions
{
public static SqlGuid ToSqlGuid(this Ulid ulid)
{
var bytes = ulid.ToByteArray();
return new SqlGuid(new byte[] { bytes[6], bytes[7], bytes[8], bytes[9], bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15], bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5] });
}
public static Ulid ToUlid(this SqlGuid guid)
{
var bytes = guid.ToByteArray();
return new Ulid(new byte[] { bytes[10], bytes[11], bytes[12], bytes[13], bytes[14], bytes[15], bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7], bytes[8], bytes[9] });
}
}
Apparently EF Core doesn't support SqlGuid as a data type, so in the EF Core Value Converter I have to ulid.ToSqlGuid().Value and new SqlGuid(guid).ToUlid(), so I'm not sure using the SqlGuid type as an intermediary is as strong of an idea as it first seemed now that I have tested it against EF Core. I haven't yet double checked that it round-trips with ulid.sql, but that's next for me to double check that I got the byte reordering correct.