velox icon indicating copy to clipboard operation
velox copied to clipboard

Add support for UUID comparison functions

Open BryanCutler opened this issue 1 year ago • 22 comments

This adds binary comparison functions <, >, <=, >= to the UUID custom data type. Equality functions are already present. Added unit tests for testing a query with comparisons between UUID values.

Also fixes the binary representation of UuidType to be a LE int128_t and adjusted Presto serde to put UUID values in the format that Presto Java expects, as 2 long values with msb first.

From https://github.com/facebookincubator/velox/issues/10584, https://github.com/prestodb/presto/issues/23311

BryanCutler avatar Aug 20 '24 23:08 BryanCutler

Deploy Preview for meta-velox canceled.

Name Link
Latest commit d1c96ae11624779351f6d2da93060d05f9efb434
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67801b69fef75a000826dc50

netlify[bot] avatar Aug 20 '24 23:08 netlify[bot]

Did you try running the queries in prestodb/presto#23301 with this logic ?

Yes, and the queries for <,>,<=,>= now pass. But it is a little strange, they only pass for me in debug mode, when I run in release mode I get this type of error:

not equal
Actual rows (1 of 1 extra rows shown, 6 rows in total):
    [f768f36d-4f09-4da7-a298-3564d8f3c986, 1fdff429-6f65-4370-bb2a-7ab660dbec92]
Expected rows (1 of 1 missing rows shown, 6 rows in total):
    [1fdff429-6f65-4370-bb2a-7ab660dbec92, f768f36d-4f09-4da7-a298-3564d8f3c986]

where it looks like one UUID is randomly generated, and I'm not sure where it's coming from. I'll look into that a bit more.

BryanCutler avatar Aug 22 '24 04:08 BryanCutler

@aditi-pandit I didn't realize the temp table from the test in https://github.com/prestodb/presto/pull/23301 included a randomly generated UUID with (cast(uuid() AS VARCHAR)). That was the reason for failing in release mode, without this random value all the comparison tests pass.

BryanCutler avatar Aug 22 '24 17:08 BryanCutler

@Yuhta @kagamiori @mbasmanova : Please can you help with the review.

aditi-pandit avatar Aug 29 '24 09:08 aditi-pandit

Gentle ping @Yuhta @kagamiori @mbasmanova to please review

BryanCutler avatar Sep 09 '24 04:09 BryanCutler

Thanks for reviewing @mbasmanova

I wonder where is the semantics of comparing UUIDs come from? Do you have a pointer? Would be nice to update PR description and documentation

I basically followed the comparison done in in Presto Java https://github.com/prestodb/presto/blob/97d88cfdebb553b5c7fb0e217b0211573df3d190/presto-common/src/main/java/com/facebook/presto/common/type/UuidType.java#L97

Since int128_t doesn't have support for atomic operations, we first check the higher bytes, then lower. Returning an int from the comparison allows it to be streamlined to do minimal checks. I will update the PR and documentation.

BryanCutler avatar Sep 10 '24 19:09 BryanCutler

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

mbasmanova avatar Sep 10 '24 19:09 mbasmanova

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@mbasmanova I think it makes sense, just as you would compare values for std::byte, etc. Most UUID implementations support such operations, see https://www.boost.org/doc/libs/1_85_0/libs/uuid/doc/uuid.html#Operators

BryanCutler avatar Sep 10 '24 19:09 BryanCutler

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@mbasmanova I think it makes sense, just as you would compare values for std::byte, etc. Most UUID implementations support such operations, see https://www.boost.org/doc/libs/1_85_0/libs/uuid/doc/uuid.html#Operators

@mbasmanova : I found UUID comparisons in the Java spec also https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html

My understanding is that a source generating UUIDs uses a combination of timestamp with the DNS address of the generating device. So there is some notion of ordering of UUIDs as a result.

aditi-pandit avatar Sep 12 '24 11:09 aditi-pandit

@mbasmanova just wanted to confirm if you are ok with adding this functionality? I still need to update the tests per your request and will get to that soon, thanks!

BryanCutler avatar Sep 18 '24 20:09 BryanCutler

@BryanCutler Yes, please, move ahead with addressing the comments. Thank you for clarifying the use case.

mbasmanova avatar Sep 18 '24 20:09 mbasmanova

@mbasmanova @aditi-pandit after looking into https://github.com/prestodb/presto/issues/23311 there are a couple issues with UUIDs in Velox:

  1. UUID data is being stored as big endian in the int128_t value, which comes from using boost:uuid during casting to/from strings. This causes comparisons like 11000000-0000-0022-0000-000000000000 < 22000000-0000-0011-0000-000000000000 to fail.
  2. Serialization of the int128_t value to an int128 Array Block results in [LSW, MSW], however the Presto Java UUIDType expects long values from the int128 Array Block to be [MSW, LSW].

For (1), since Velox stores UUID values with an int128_t, unlike boost::uuid which is a byte array, I think the value should be in little endian. This means instead of directly using memcpy bytes from the boost::uuid during casting, the bytes would need to swapped to get int128 in LE order. wdyt?

To fix (2), the word order would need to be switched during serialization. Would you prefer I fix that in this PR or open another?

BryanCutler avatar Sep 23 '24 18:09 BryanCutler

I feel the first fix should be to address is the issue of the incorrect results in https://github.com/prestodb/presto/issues/23311. Post that we can do this comparisons PR.

From your comments seems that requires 2 changes: i) Fixing the seriailization order between the native and java sides (2) above ii) Changing the uuid int128_t to be little endian in the copy from boost::uuid.

@mbasmanova : Using boost::uuid (instead of int128_t) to represent UUID in Velox could be cleaner given these issues. We could make UUID TypeKind::VARBINARY and use the boost::uuid bytes to represent it. wdyt ?

@BryanCutler : If we did that, would we need the serialization changes from Native to Java side as well ? My intuition is yes, but I could be wrong.

@Yuhta : Jimmy, would be great to hear your thoughts as well.

aditi-pandit avatar Sep 25 '24 17:09 aditi-pandit

Thanks @aditi-pandit , if we did change the storage from int128_t to a byte array in boost::uuid, then we would still need to adjust the serialization to Java, since Presto expects an Int128ArrayBlock.

BryanCutler avatar Sep 25 '24 18:09 BryanCutler

@aditi-pandit I don't think VARBINARY is the right physical type for UUID, it is meant for variable length data, while UUID is fixed length. I would say just keep using int128 and fix the endianness so the comparison between UUID is the same as comparison between int128.

Yuhta avatar Oct 01 '24 21:10 Yuhta

Thanks @Yuhta for your response. I do see a pattern of using VARBINARY for fixed lengths (in IPPREFIX type) but that has length 17 bytes which can't be captured with any integral type. For UUID we do have the option of using int128_t.

@BryanCutler : Let's go with Jimmy's suggestion. Will make our implementation simpler as well.

aditi-pandit avatar Oct 01 '24 21:10 aditi-pandit

Thanks @Yuhta and @aditi-pandit , I have fixed the serialization issue here and added a unit test, I can put that in a separate PR if preferred. I will also test these changes again with https://github.com/prestodb/presto/issues/23311

BryanCutler avatar Oct 07 '24 18:10 BryanCutler

After testing the comparisons with Presto Java in https://github.com/prestodb/presto/issues/23311 there are some different results because Presto Java uses Slice.compareTo() which compares values by swapping bytes of each long value, see https://github.com/airlift/slice/blob/0.38/src/main/java/io/airlift/slice/Slice.java#L1336. So the values are ordered according to the LSB of each long value, which seems a little strange and I'm not sure if that should be changed.

** moved the serializer fix to https://github.com/facebookincubator/velox/pull/11197, will update this pr soon ** updated - the underlying issue in Presto Java is an additional byte swap, not an overflow from https://bugs.openjdk.org/browse/JDK-7025832

BryanCutler avatar Oct 08 '24 20:10 BryanCutler

Just compare (uint128_t)lhs and (uint128_t)rhs

Thanks @Yuhta , I can look into that but there is a bigger concern that this type of ordering will not match Presto Java, see above https://github.com/facebookincubator/velox/pull/10791#issuecomment-2400723075.

Presto Java UuidOperations use Slice.compareTo() which follows a strange ordering based on swapping bytes. As far as I can tell this ordering was changed in the Slice library 2.0+, which uses Arrays.compareUnsigned and I don't believe it would have the same problem, but not sure upgrading that is planned in the near-term.

I'm not sure if we should try to follow Presto Java with the comparisons or use the spec at RFC-4122 to order lexicographically, e.g. comparing uint128_t in little endian, and differ from Presto Java. Please share recommendations on this @Yuhta @aditi-pandit

BryanCutler avatar Oct 16 '24 18:10 BryanCutler

There is a recent related PR cherry-picked from trindodb, this might address some part of the Presto Java issue https://github.com/prestodb/presto/pull/17939.

BryanCutler avatar Oct 16 '24 18:10 BryanCutler

@BryanCutler We should ask Presto Java community to decide the ordering of UUID and document them officially first.

Yuhta avatar Oct 16 '24 18:10 Yuhta

@BryanCutler We should ask Presto Java community to decide the ordering of UUID and document them officially first.

There is a PR up at https://github.com/prestodb/presto/pull/23847 that corrects Presto Java to conform to the IETF RFC 4122 definition of ordering UUIDs. When that is resolved, this PR can be updated to match.

BryanCutler avatar Oct 21 '24 23:10 BryanCutler

Status: PR https://github.com/prestodb/presto/pull/23847 has been merged, and I have updated the comparisons to use a cast to uint128_t, and this includes the serialization commit from https://github.com/facebookincubator/velox/pull/10791 which stores the UUID value as little endian. Once that PR is merged, it can be dropped from here.

I have also checked these changes again to ensure the Presto Java tests from https://github.com/prestodb/presto/pull/23301

BryanCutler avatar Nov 07 '24 19:11 BryanCutler

Finished rebasing after fixing serialzation issue at https://github.com/facebookincubator/velox/pull/10791

BryanCutler avatar Nov 21 '24 23:11 BryanCutler

@Yuhta : Please can you help further review and merge.

aditi-pandit avatar Dec 03 '24 15:12 aditi-pandit

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Dec 03 '24 16:12 facebook-github-bot

@BryanCutler Can you make sure the failed tests is related or not? https://github.com/facebookincubator/velox/actions/runs/12128994244/job/33818706033?fbclid=IwZXh0bgNhZW0CMTEAAR13mXks2oVNMArutfUoHnSSEWNU9OqKgAzKSe8NEcDoWjHe8BNZUhElEXc_aem_DC2AdtBRx0Wvfd-JJvU_HQ

Yuhta avatar Dec 03 '24 17:12 Yuhta

It seems like one of the problems is the biased expression fuzzer test tries to create a hive table with IPADDRESS or UUID as a type, and they are not supported so the reference query fails.

Locally, this is one of the errors I see

I20241210 12:16:31.433363 354597 PrestoQueryRunner.cpp:804] Execute presto sql: 
CREATE TABLE tmp(c0, c1, row_number) WITH (format = 'DWRF') AS SELECT cast(null as IPADDRESS), cast(null as BOOLEAN), cast(null as BIGINT)
E20241210 12:16:31.447225 354597 Exceptions.h:66] Line: /home/bryan/git/velox/velox/exec/fuzzer/PrestoQueryRunner.cpp:98, Function:throwIfFailed, Expression:  
Presto query failed: 65536 Unsupported type: ipaddress, Source: RUNTIME, ErrorCode: INVALID_STATE

The test might have to be adjusted for these types, I'll look into what we can do about it.

BryanCutler avatar Dec 10 '24 20:12 BryanCutler

Another PR has failed biased expression test, not quite the same error since testing different functions https://github.com/facebookincubator/velox/actions/runs/12279548356/job/34266985156?pr=11612

BryanCutler avatar Dec 11 '24 23:12 BryanCutler

Similar to https://github.com/facebookincubator/velox/pull/11820, I marked UUID unsupported for PrestoQueryRunner because it will try a create a HIVE table with the unsupported type.

There is still a biased expression fuzzer failure, but I can reproduce that on main branch locally too. I can look into that more but seems most likely unrelated.

BryanCutler avatar Dec 16 '24 16:12 BryanCutler