protobuf icon indicating copy to clipboard operation
protobuf copied to clipboard

Provide support for UUID type (a.k.a. GUID)

Open jtattermusch opened this issue 9 years ago • 61 comments

Filing on behalf of a customer: Protobuf lacks Uuid (Guid in .NET) support out of the box. It would have been nice to have a Well-Known Type (like we do with Timestamp to represent Date and Times) since Uuids are pretty common, particularly in distributed systems.

jtattermusch avatar Oct 06 '16 07:10 jtattermusch

CC @jskeet who was involved in the discussion.

jtattermusch avatar Oct 06 '16 07:10 jtattermusch

Is this still on the roadmap?

franchyd avatar Aug 02 '18 19:08 franchyd

No, this is not on our roadmap.

xfxyjwf avatar Aug 02 '18 20:08 xfxyjwf

Any news?

listepo avatar Aug 26 '19 23:08 listepo

No, this is not on our roadmap.

Why not?

mihaimyh avatar Nov 01 '19 09:11 mihaimyh

Regardless of whether this is on the roadmap or not, I can see two possible designs:

Option 1

// Message representing a version 4 universally unique identifier. See
// rfc/4122#section-4.4 for additional information.
message UUID {
  // The two int64s below, should be populated with the most and least
  // significant 64 bits of a version 4 UUID.
  // (e.g., https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html).
  uint64 most_significant_uuid_bits = 1
  uint64 least_significant_uuid_bits = 2
}

Option 2

message UUID {
  string value = 1
}

jtattermusch avatar Nov 01 '19 12:11 jtattermusch

the UUID represented by two uint64 values will have problem with endianness: see how-do-i-represent-a-uuid-in-a-protobuf-message discussion.

kucint avatar Jan 03 '20 12:01 kucint

doesn't RFC4122 section 4.1.2 present a solution to the problem identified by @kucint ?

gmabey avatar Feb 19 '20 20:02 gmabey

bump

onesteveo avatar Sep 03 '20 00:09 onesteveo

I think @gmabey is correct in that RFC 4122 section 4.1.2 presents a solution to allow a UUID to be encoded in binary (as opposed to text) and allow the endianness to be handled by the protobuf encoding layer (as opposed to at the application layer). This approach would have a proto-spec like below.

// A UUID, encoded in accordance with section 4.1.2 of RFC 4122.
message Uuid {
	// The low field of the timestamp (32 bits).
	fixed32 time_low = 1;

	// The middle field of the timestamp (16 bits).
	uint32 time_mid = 2;

	// The high field of the timestamp multiplexed with the version number (16 bits).
	uint32 time_hi_and_version = 3;

	// The high field of the clock sequence multiplexed with the variant (8 bits).
	uint32 clock_seq_hi_and_reserved = 4;

	// The low field of the clock sequence (8 bits).
	uint32 clock_seq_low = 5;

	// The spatially unique node identifier (48 bits).
	uint64 node = 6;
}

This would be encoded from a System.Guid in .NET/C# as follows.

Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);
TimeLow = BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4));
TimeMid = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2));
TimeHiAndVersion = BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2));
ClockSeqHiAndReserved = bytes[8];
ClockSeqLow = bytes[9];
Node = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8)) & 0x0000FFFFFFFFFFFF;

... and decoded as follows.

checked
{
	Span<byte> bytes = stackalloc byte[16];
	BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), TimeLow);
	BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)TimeMid);
	BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)TimeHiAndVersion);
	BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8), Node);
	bytes[8] = (byte)ClockSeqHiAndReserved;
	bytes[9] = (byte)ClockSeqLow;
	return new Guid(bytes);
}

I'm keen to get people's thoughts and feedback on this approach.

bill-poole avatar Jan 07 '21 11:01 bill-poole

I've just done some benchmarking of string-encoded vs little endian byte array-encoded vs RFC 4122-encoded UUIDs in .NET 5 and the results are below.

Method Mean Error StdDev
ConvertToStringUuid 52.498 ns 0.2467 ns 0.2308 ns
ConvertToLittleEndianBinaryUuid 64.575 ns 0.2746 ns 0.2293 ns
ConvertToRfc4122Uuid 10.849 ns 0.0662 ns 0.0620 ns
SerialiseStringUuid 54.187 ns 0.4128 ns 0.5511 ns
SerialiseLittleEndianByteArrayUuid 25.091 ns 0.1539 ns 0.1364 ns
SerialiseRfc4122Uuid 61.701 ns 0.4367 ns 0.4085 ns
DeserialiseStringUuid 224.519 ns 1.2474 ns 1.1058 ns
DeserialiseLittleEndianByteArrayUuid 217.149 ns 0.9792 ns 0.8177 ns
DeserialiseRfc4122Uuid 150.113 ns 0.4288 ns 0.3801 ns
ConvertFromStringUuid 82.985 ns 0.2628 ns 0.2459 ns
ConvertFromLittleEndianByteArrayUuid 2.475 ns 0.0123 ns 0.0115 ns
ConvertFromRfc4122Uuid 9.682 ns 0.0210 ns 0.0186 ns

The aggregate performance of the three approaches is:

  • Convert and serialise StringUuid: 106.685 ns
  • Convert and serialise LittleEndianBinaryUuid: 89.666 ns
  • Convert and serialise Rfc4122Uuid: 72.55 ns
  • Deserialise and convert StringUuid: 307.504 ns
  • Deserialise and convert LittleEndianBinaryUuid: 219.624 ns
  • Deserialise and convert Rfc4122Uuid: 159.795 ns

So the RFC 4122-based representation is fastest in both serialisation and deserialisation.

However, the StringUuid serialises to 38 bytes, the LittleEndianBinaryUuid to 18 bytes and the Rfc4122Uuid 27 bytes - according to the CalculateSize() method on each message type. So, while the RFC 4122-based encoding is faster, it is 50% larger than the little endian binary encoding on the wire.

Note also ByteString.UnsafeWrap (see #7645) will improve the ConvertToLittleEndianBinaryUuid performance when it is available.

bill-poole avatar Jan 11 '21 16:01 bill-poole

Neither string or byte array are good solutions from a security perspective because they can be abused in certain kinds of DOS or fuzzing attacks. I like the idea of a specific implementation.

tdhintz avatar Jan 12 '21 22:01 tdhintz

@tdhintz Are you referring to something like

message WellKnownUUID {
    uint32 w1 = 1;
    uint32 w2 = 2;
    uint32 w3 = 3;
    uint32 w4 = 4;
}

There certainly isn't much variability to that structure! Perhaps @billpoole-mi would be kind enough to benchmark that approach?

gmabey avatar Jan 13 '21 15:01 gmabey

@gmabey you need to have the 6 UUID elements defined as per RFC 4122 for this approach to work because those 6 elements are all defined as unsigned integers and therefore by defining the message this way, we avoid any endianness issues.

For example, I'm assuming the w2 element in your WellKnownUUID message would correspond to the time_mid and time_hi_and_version RFC 4122 fields, but it isn't specified whether the high 16 bits are the time_hi_and_version or the low 16 bits.

It would of course be possible to specify how to read the two 16-bit values from w2 as part of the documentation of the WellKnownUUID message such that the converters to/from this type on each platform do so correctly. But if you're willing to move the responsibility for this into the converters, you might as well go all the way with it and define the message with two ulong fields.

You would then specify the first ulong value is time_low in the high 32 bits of the high 64-bit field, then time_mid in the high 16 bits of the low 32 bits of the high 64-bit field and time_hi_and_version in the low 16 bits of the high 64-bit field. You'd apply similar logic to the low 64-bit field.

This would likely result in a smaller message size (i.e. with 2 fields instead of 6), but carries the inconvenience of the converters having to deal with picking the 6 fields defined by RFC 4122 from the 2 64-bit ulong fields.

In the end, this is effectively defining the message as a 16-byte binary buffer and leaving it up to the converters to properly read/write the 6 values defined by RFC 4122 from/to the buffer.

bill-poole avatar Jan 14 '21 05:01 bill-poole

@gmabey Yes, avoid use of arrays and strings (which really are just a specialized array).

tdhintz avatar Jan 14 '21 08:01 tdhintz

I've now tested structuring the UUID message as two 64-bit fixed integers. The proto spec is below.

// A UUID, encoded in accordance with section 4.1.2 of RFC 4122.
message Uuid {
	// The high 64 bits of the UUID - MSB -> LSB: time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits).
	fixed64 high64 = 1;

	// The low 64 bits of the UUID - MSB -> LSB: clock_seq_hi_and_reserved (8 bits) | clock_seq_low (8 bits) | node (48 bits).
	fixed64 low64 = 2;
}

This is encoded from a System.Guid in .NET as follows.

Span<byte> bytes = stackalloc byte[16];
guid.TryWriteBytes(bytes);

// MSB -> LSB: time_low (32 bits) | time_mid (16 bits) | time_hi_and_version (16 bits).
High64 = ((ulong)BinaryPrimitives.ReadUInt32LittleEndian(bytes.Slice(0, 4)) << 32) // time_low
	| ((ulong)BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(4, 2)) << 16) // time_mid
	| BinaryPrimitives.ReadUInt16LittleEndian(bytes.Slice(6, 2)); // time_hi_and_version

// MSB -> LSB: clock_seq_hi_and_reserved (8 bits) | clock_seq_low (8 bits) | node (48 bits).
Low64 = BinaryPrimitives.ReadUInt64BigEndian(bytes.Slice(8, 8));

It is converted back to a System.Guid as follows.

Span<byte> bytes = stackalloc byte[16];
BinaryPrimitives.WriteUInt32LittleEndian(bytes.Slice(0, 4), (uint)(High64 >> 32));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(4, 2), (ushort)((High64 >> 16) & 0xFFFF));
BinaryPrimitives.WriteUInt16LittleEndian(bytes.Slice(6, 2), (ushort)(High64 & 0xFFFF));
BinaryPrimitives.WriteUInt64BigEndian(bytes.Slice(8, 8), Low64);
return new Guid(bytes);

The Uuid message size is 18 bytes (as opposed to 27 bytes when defining the Uuid message with the 6 individual fields defined by RFC 4122).

The conversion/serialisation/deserialisation benchmarks are below.

Method Mean Error StdDev
ConvertToUuid 8.825 ns 0.0665 ns 0.0622 ns
SerialiseUuid 20.942 ns 0.0891 ns 0.0790 ns
DeserialiseUuid 96.178 ns 0.3735 ns 0.3494 ns
ConvertFromUuid 9.520 ns 0.0578 ns 0.0541 ns

Convert & serialise is 29.767 ns and deserialise & convert is 105.698 ns.

So this approach is much faster and more efficient on the wire than defining the Uuid message with the 6 fields defined by RFC 4122.

bill-poole avatar Jan 15 '21 04:01 bill-poole

For UUIDv4, MSB is both positive and negative, while the LSB is always negative. So, shouldn't it be sfixed64?

singhbaljit avatar Apr 30 '21 19:04 singhbaljit

fixed64 and sfixed64 are the same on the wire. The only difference is how their bits are interpreted by the sending/receiving endpoints.

In this case, the bits are interpreted by breaking the 64 bits into the components defined by section 4.1.2 of RFC 4122.

That is, the 64 bits are never interpreted as a positive 64-but integer nor a negative 64-bit integer. Therefore, it is fine to encode it either way.

However, since the sign is semantically irrelevant, I think it’s better to encode as fixed64. It also makes the code that writes the UUID to the message simpler in .NET.

bill-poole avatar May 02 '21 04:05 bill-poole

Are there any plans for implementing this? I guess we have a lot of good examples and speed tests available so it could be easy integrated.

AtosNicoS avatar Jul 29 '21 11:07 AtosNicoS

We have no plans at this time to integrate this.

perezd avatar Jul 29 '21 16:07 perezd

Is there a way to continue to petition for this feature? Or anyways we can help to make it happen? Thanks team...

p.s. Any help in how others are dealing with this would be welcomed also. Programming in C#, currently just using a string and the variable name is prepended with 'Guid' to help me know which are Strings and which are GUID/UUID but I'm running into problems often. Links, URLs anything helpful, please :)

kibblewhite avatar Apr 18 '22 01:04 kibblewhite

Most folks I have seen simply use a string for this.

I don't really see a path forward for this. The cost of adding this as a specific well known type is quite high as compared to having a third_party simply package their preferred proto with some helper functions.

fowles avatar Apr 18 '22 17:04 fowles

I certainly just use a string (without curly braces) for this myself. The benefits that seem to be implied were such an effort to be undertaken are: speed, security (I guess), and interoperability. The argument that a third_party function could implement this could be used against most of the data types currently supported by WellKnownTypes -- since you could serialize a date time to string and have a third_party function deserialize it.

Please do reply to this message if you dispute any of these points:

  1. UUID is well known.
  2. UUID is well defined.
  3. UUID is very common. (in my world haha)
  4. Protobuf messages would be "more well defined" (sigh, I wish there was a better way to say that) if a first-class (hrmm, I guess WellKnownTypes are second-class citizens) data type existed that standardized serialization/deserialization and conversion to/from platform specific classes (like in python).
  5. Point (4.) would make including UUIDs (as a member of a message) less error prone.

gmabey avatar Apr 18 '22 18:04 gmabey

I would dispute points (3) and (4).

Point 3: Some quick searches for code search indicates that absl::Time is about 100x more common than our UUID class within google's C++ codebase.

Point 4: Getting something cross language tends to nail down a bunch of painful corners in ways that are not helpful. The WellKnownType for time actually causes frequent impedance mismatches with language bindings that have slightly different concepts of time.

fowles avatar Apr 18 '22 18:04 fowles

@fowles I don't dispute your rebuttal to Point 3, but I didn't define "very" -- wahoo! :-D

Regarding your Point 4 rebuttal -- do you see any "painful corners" associated with UUIDs? Or, are you just complaining about corner cases of Google.Protobuf.WellKnownTypes.Timestamp? (if so, wrong thread ;-)

gmabey avatar Apr 18 '22 19:04 gmabey

I don't know UUIDs particularly well. It is possible they are simpler enough that they won't hit such impedance mismatches. Regardless, point (3) alone is enough for me to continue to feel confident that this doesn't rise to the bar where we want to add it to the core of protobuffer.

fowles avatar Apr 18 '22 19:04 fowles

Point 3: Some quick searches for code search indicates that absl::Time is about 100x more common than our UUID class within google's C++ codebase.

Is protobuf meant for Google's use only? 🤔

Quick check on the node.js world: protobufjs has 7,194,982 weekly downloads, UUID has 59,615,024 weekly downloads (and it's only one implementation). UUID is a standard, rfc4122, and its increasing adoption has been doing wonders to increase interoperability and reliability in various areas.

Realistically speaking, for a team starting a new project, the fact that protobuf has no UUID support is more likely to result in the team not using protobuf, than not using UUID.

As for point 4, there would be no impedance mismatches, since it is a standard. Yes, I've been to that place, having code that uses miliseconds since epoch talking to code that uses seconds since epoch, but UUID is UUID, that's the whole point of its existence.

lalomartins avatar Apr 18 '22 19:04 lalomartins

protobuf is intended for public use, but Google maintains full ownership of it and its evolution. The flip side is that Google also provides the vast majority of the maintenance cost of it.

As you note, UUIDs are a standard and libraries exist in most languages to parse them to and from strings. I would advise any group that wants to encode them in protobufs to use a bytes field. If you are starting a new project and are unwilling to accept that trade off, that is a totally reasonable choice for you to make.

fowles avatar Apr 18 '22 19:04 fowles

Realistically speaking, for a team starting a new project, the fact that protobuf has no UUID support is more likely to result in the team not using protobuf, than not using UUID.

This isn't mutually exclusive. UUIDs are able to be freely represented as bytes or encoded as hex values and written to strings (where they a majority of observed use cases show up).

adding an explicit UUID type I guess provides...validation as the primary feature request? I dunno what else you really need an explicit type here for. JSON doesn't have a UUID type (hell, at least protobuf has bytes which JSON does not) and nobody stops using JSON for lack of "support".

Futhermore, consider the JSON/protobuf interop requirements....for JSON, they'll just end up as a string again, so what have we really done here?

perezd avatar Apr 18 '22 20:04 perezd

Futhermore, consider the JSON/protobuf interop requirements....for JSON, they'll just end up as a string again, so what have we really done here?

Not using JSON in our project, but also you are right in saying that JSON provides no UUID support, but I would like to add that JSON doesn't add any support for data types like date/times/etc...?

From the sounds of things from fowles comment, it seems like a cost thing as Google provides the vast majority of the maintenance costs? Could it just need a financial push in that direction?

Anyways, looking forwards to seeing how this might (or not) resolve in the future. I'll continue to use a string or bytes field with the variable name prepended with Guid for now. Thanks to everyone for the input, it's been insightful.

kibblewhite avatar Apr 20 '22 12:04 kibblewhite