Provide support for UUID type (a.k.a. GUID)
Filing on behalf of a customer: Protobuf lacks Uuid (Guid in .NET) support out of the box. It would have been nice to have a Well-Known Type (like we do with Timestamp to represent Date and Times) since Uuids are pretty common, particularly in distributed systems.
CC @jskeet who was involved in the discussion.
Is this still on the roadmap?
No, this is not on our roadmap.
Any news?
No, this is not on our roadmap.
Why not?
Regardless of whether this is on the roadmap or not, I can see two possible designs:
Option 1
// Message representing a version 4 universally unique identifier. See
// rfc/4122#section-4.4 for additional information.
message UUID {
// The two int64s below, should be populated with the most and least
// significant 64 bits of a version 4 UUID.
// (e.g., https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html).
uint64 most_significant_uuid_bits = 1
uint64 least_significant_uuid_bits = 2
}
Option 2
message UUID {
string value = 1
}
the UUID represented by two uint64 values will have problem with endianness: see how-do-i-represent-a-uuid-in-a-protobuf-message discussion.
doesn't RFC4122 section 4.1.2 present a solution to the problem identified by @kucint ?
bump
I think @gmabey is correct in that RFC 4122 section 4.1.2 presents a solution to allow a UUID to be encoded in binary (as opposed to text) and allow the endianness to be handled by the protobuf encoding layer (as opposed to at the application layer). This approach would have a proto-spec like below.
// A UUID, encoded in accordance with section 4.1.2 of RFC 4122.
message Uuid {
// The low field of the timestamp (32 bits).
fixed32 time_low = 1;
// The middle field of the timestamp (16 bits).
uint32 time_mid = 2;
// The high field of the timestamp multiplexed with the version number (16 bits).
uint32 time_hi_and_version = 3;
// The high field of the clock sequence multiplexed with the variant (8 bits).
uint32 clock_seq_hi_and_reserved = 4;
// The low field of the clock sequence (8 bits).
uint32 clock_seq_low = 5;
// The spatially unique node identifier (48 bits).
uint64 node = 6;
}
This would be encoded from a System.Guid in .NET/C# as follows.
Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);
TimeLow = BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4));
TimeMid = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2));
TimeHiAndVersion = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2));
ClockSeqHiAndReserved = bytes[8];
ClockSeqLow = bytes[9];
Node = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8)) & 0x0000FFFFFFFFFFFF;
... and decoded as follows.
checked
{
Span<byte> bytes = stackalloc byte[16];
BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), TimeLow);
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)TimeMid);
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)TimeHiAndVersion);
BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8), Node);
bytes[8] = (byte)ClockSeqHiAndReserved;
bytes[9] = (byte)ClockSeqLow;
return new Guid(bytes);
}
I'm keen to get people's thoughts and feedback on this approach.
I've just done some benchmarking of string-encoded vs little endian byte array-encoded vs RFC 4122-encoded UUIDs in .NET 5 and the results are below.
| Method | Mean | Error | StdDev |
|---|---|---|---|
| ConvertToStringUuid | 52.498 ns | 0.2467 ns | 0.2308 ns |
| ConvertToLittleEndianBinaryUuid | 64.575 ns | 0.2746 ns | 0.2293 ns |
| ConvertToRfc4122Uuid | 10.849 ns | 0.0662 ns | 0.0620 ns |
| SerialiseStringUuid | 54.187 ns | 0.4128 ns | 0.5511 ns |
| SerialiseLittleEndianByteArrayUuid | 25.091 ns | 0.1539 ns | 0.1364 ns |
| SerialiseRfc4122Uuid | 61.701 ns | 0.4367 ns | 0.4085 ns |
| DeserialiseStringUuid | 224.519 ns | 1.2474 ns | 1.1058 ns |
| DeserialiseLittleEndianByteArrayUuid | 217.149 ns | 0.9792 ns | 0.8177 ns |
| DeserialiseRfc4122Uuid | 150.113 ns | 0.4288 ns | 0.3801 ns |
| ConvertFromStringUuid | 82.985 ns | 0.2628 ns | 0.2459 ns |
| ConvertFromLittleEndianByteArrayUuid | 2.475 ns | 0.0123 ns | 0.0115 ns |
| ConvertFromRfc4122Uuid | 9.682 ns | 0.0210 ns | 0.0186 ns |
The aggregate performance of the three approaches is:
- Convert and serialise
StringUuid: 106.685 ns - Convert and serialise
LittleEndianBinaryUuid: 89.666 ns - Convert and serialise
Rfc4122Uuid: 72.55 ns - Deserialise and convert
StringUuid: 307.504 ns - Deserialise and convert
LittleEndianBinaryUuid: 219.624 ns - Deserialise and convert
Rfc4122Uuid: 159.795 ns
So the RFC 4122-based representation is fastest in both serialisation and deserialisation.
However, the StringUuid serialises to 38 bytes, the LittleEndianBinaryUuid to 18 bytes and the Rfc4122Uuid 27 bytes - according to the CalculateSize() method on each message type. So, while the RFC 4122-based encoding is faster, it is 50% larger than the little endian binary encoding on the wire.
Note also ByteString.UnsafeWrap (see #7645) will improve the ConvertToLittleEndianBinaryUuid performance when it is available.
Neither string or byte array are good solutions from a security perspective because they can be abused in certain kinds of DOS or fuzzing attacks. I like the idea of a specific implementation.
@tdhintz Are you referring to something like
message WellKnownUUID {
uint32 w1 = 1;
uint32 w2 = 2;
uint32 w3 = 3;
uint32 w4 = 4;
}
There certainly isn't much variability to that structure! Perhaps @billpoole-mi would be kind enough to benchmark that approach?
@gmabey you need to have the 6 UUID elements defined as per RFC 4122 for this approach to work because those 6 elements are all defined as unsigned integers and therefore by defining the message this way, we avoid any endianness issues.
For example, I'm assuming the w2 element in your WellKnownUUID message would correspond to the time_mid and time_hi_and_version RFC 4122 fields, but it isn't specified whether the high 16 bits are the time_hi_and_version or the low 16 bits.
It would of course be possible to specify how to read the two 16-bit values from w2 as part of the documentation of the WellKnownUUID message such that the converters to/from this type on each platform do so correctly. But if you're willing to move the responsibility for this into the converters, you might as well go all the way with it and define the message with two ulong fields.
You would then specify the first ulong value is time_low in the high 32 bits of the high 64-bit field, then time_mid in the high 16 bits of the low 32 bits of the high 64-bit field and time_hi_and_version in the low 16 bits of the high 64-bit field. You'd apply similar logic to the low 64-bit field.
This would likely result in a smaller message size (i.e. with 2 fields instead of 6), but carries the inconvenience of the converters having to deal with picking the 6 fields defined by RFC 4122 from the 2 64-bit ulong fields.
In the end, this is effectively defining the message as a 16-byte binary buffer and leaving it up to the converters to properly read/write the 6 values defined by RFC 4122 from/to the buffer.
@gmabey Yes, avoid use of arrays and strings (which really are just a specialized array).
I've now tested structuring the UUID message as two 64-bit fixed integers. The proto spec is below.
// A UUID, encoded in accordance with section 4.1.2 of RFC 4122.
message Uuid {
// The high 64 bits of the UUID - MSB -> LSB: time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits).
fixed64 high64 = 1;
// The low 64 bits of the UUID - MSB -> LSB: clock_seq_hi_and_reserved (8 bits) | clock_seq_low (8 bits) | node (48 bits).
fixed64 low64 = 2;
}
This is encoded from a System.Guid in .NET as follows.
Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);
// MSB -> LSB: time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits).
High64 = ((ulong)BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4)) << 32) // time_low
| ((ulong)BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2)) << 16) // time_mid
| BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2)); // time_hi_and_version
// MSB -> LSB: clock_seq_hi_and_reserved (8 bits) | clock_seq_low (8 bits) | node (48 bits).
Low64 = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8));
It is converted back to a System.Guid as follows.
Span<byte> bytes = stackalloc byte[16];
BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), (uint)(High64 >> 32));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)((High64 >> 16) & 0xFFFF));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)(High64 & 0xFFFF));
BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8, 8), Low64);
return new Guid(bytes);
The Uuid message size is 18 bytes (as opposed to 27 bytes when defining the Uuid message with the 6 individual fields defined by RFC 4122).
The conversion/serialisation/deserialisation benchmarks are below.
| Method | Mean | Error | StdDev |
|---|---|---|---|
| ConvertToUuid | 8.825 ns | 0.0665 ns | 0.0622 ns |
| SerialiseUuid | 20.942 ns | 0.0891 ns | 0.0790 ns |
| DeserialiseUuid | 96.178 ns | 0.3735 ns | 0.3494 ns |
| ConvertFromUuid | 9.520 ns | 0.0578 ns | 0.0541 ns |
Convert & serialise is 29.767 ns and deserialise & convert is 105.698 ns.
So this approach is much faster and more efficient on the wire than defining the Uuid message with the 6 fields defined by RFC 4122.
For UUIDv4, MSB is both positive and negative, while the LSB is always negative. So, shouldn't it be sfixed64?
fixed64 and sfixed64 are the same on the wire. The only difference is how their bits are interpreted by the sending/receiving endpoints.
In this case, the bits are interpreted by breaking the 64 bits into the components defined by section 4.1.2 of RFC 4122.
That is, the 64 bits are never interpreted as a positive 64-but integer nor a negative 64-bit integer. Therefore, it is fine to encode it either way.
However, since the sign is semantically irrelevant, I think it’s better to encode as fixed64. It also makes the code that writes the UUID to the message simpler in .NET.
Are there any plans for implementing this? I guess we have a lot of good examples and speed tests available so it could be easy integrated.
We have no plans at this time to integrate this.
Is there a way to continue to petition for this feature? Or anyways we can help to make it happen? Thanks team...
p.s. Any help in how others are dealing with this would be welcomed also. Programming in C#, currently just using a string and the variable name is prepended with 'Guid' to help me know which are Strings and which are GUID/UUID but I'm running into problems often. Links, URLs anything helpful, please :)
Most folks I have seen simply use a string for this.
I don't really see a path forward for this. The cost of adding this as a specific well known type is quite high as compared to having a third_party simply package their preferred proto with some helper functions.
I certainly just use a string (without curly braces) for this myself. The benefits that seem to be implied were such an effort to be undertaken are: speed, security (I guess), and interoperability. The argument that a third_party function could implement this could be used against most of the data types currently supported by WellKnownTypes -- since you could serialize a date time to string and have a third_party function deserialize it.
Please do reply to this message if you dispute any of these points:
- UUID is well known.
- UUID is well defined.
- UUID is very common. (in my world haha)
- Protobuf messages would be "more well defined" (sigh, I wish there was a better way to say that) if a first-class (hrmm, I guess WellKnownTypes are second-class citizens) data type existed that standardized serialization/deserialization and conversion to/from platform specific classes (like in python).
- Point (4.) would make including UUIDs (as a member of a message) less error prone.
I would dispute points (3) and (4).
Point 3: Some quick searches for code search indicates that absl::Time is about 100x more common than our UUID class within google's C++ codebase.
Point 4: Getting something cross language tends to nail down a bunch of painful corners in ways that are not helpful. The WellKnownType for time actually causes frequent impedance mismatches with language bindings that have slightly different concepts of time.
@fowles I don't dispute your rebuttal to Point 3, but I didn't define "very" -- wahoo! :-D
Regarding your Point 4 rebuttal -- do you see any "painful corners" associated with UUIDs? Or, are you just complaining about corner cases of Google.Protobuf.WellKnownTypes.Timestamp? (if so, wrong thread ;-)
I don't know UUIDs particularly well. It is possible they are simpler enough that they won't hit such impedance mismatches. Regardless, point (3) alone is enough for me to continue to feel confident that this doesn't rise to the bar where we want to add it to the core of protobuffer.
Point 3: Some quick searches for code search indicates that absl::Time is about 100x more common than our UUID class within google's C++ codebase.
Is protobuf meant for Google's use only? 🤔
Quick check on the node.js world: protobufjs has 7,194,982 weekly downloads, UUID has 59,615,024 weekly downloads (and it's only one implementation). UUID is a standard, rfc4122, and its increasing adoption has been doing wonders to increase interoperability and reliability in various areas.
Realistically speaking, for a team starting a new project, the fact that protobuf has no UUID support is more likely to result in the team not using protobuf, than not using UUID.
As for point 4, there would be no impedance mismatches, since it is a standard. Yes, I've been to that place, having code that uses miliseconds since epoch talking to code that uses seconds since epoch, but UUID is UUID, that's the whole point of its existence.
protobuf is intended for public use, but Google maintains full ownership of it and its evolution. The flip side is that Google also provides the vast majority of the maintenance cost of it.
As you note, UUIDs are a standard and libraries exist in most languages to parse them to and from strings. I would advise any group that wants to encode them in protobufs to use a bytes field. If you are starting a new project and are unwilling to accept that trade off, that is a totally reasonable choice for you to make.
Realistically speaking, for a team starting a new project, the fact that protobuf has no UUID support is more likely to result in the team not using protobuf, than not using UUID.
This isn't mutually exclusive. UUIDs are able to be freely represented as bytes or encoded as hex values and written to strings (where they a majority of observed use cases show up).
adding an explicit UUID type I guess provides...validation as the primary feature request? I dunno what else you really need an explicit type here for. JSON doesn't have a UUID type (hell, at least protobuf has bytes which JSON does not) and nobody stops using JSON for lack of "support".
Futhermore, consider the JSON/protobuf interop requirements....for JSON, they'll just end up as a string again, so what have we really done here?
Futhermore, consider the JSON/protobuf interop requirements....for JSON, they'll just end up as a string again, so what have we really done here?
Not using JSON in our project, but also you are right in saying that JSON provides no UUID support, but I would like to add that JSON doesn't add any support for data types like date/times/etc...?
From the sounds of things from fowles comment, it seems like a cost thing as Google provides the vast majority of the maintenance costs? Could it just need a financial push in that direction?
Anyways, looking forwards to seeing how this might (or not) resolve in the future. I'll continue to use a string or bytes field with the variable name prepended with Guid for now. Thanks to everyone for the input, it's been insightful.