phoenix Phoenix-6742 Add UUID type

As described in the jira https://issues.apache.org/jira/browse/PHOENIX-6742 , these are the changes I see necessary to implement the UUID type. The tests I made have worked correctly, but I have some doubts that should be resolved before doing the merge.

The UUID type is added along with three functions:

-STR_TO_UUID(string) -> UUID -UUID_TO_TO_STR(UUID) -> String -UUID_RAND() -> UUID

The type is called UUID which will be mapped to a 16 bytes (more compact and more efficient than mapping to a string as: "f90b6b50-f1b5-459b-b4dd-d6da4ddb5655")

The problem is that this type, the UUID , can contain 0x00 or 0xFF. Therefore, when you want to create an index with this type, due to the limitation that non-nulls cannot have a byte with 0x00 or 0xFF. non-nulls cannot have a fixed size in the primary key (in the index), you have to transform it to another type.

them, when making index with UUID, it is shitched to another data type: UUID_INDEXABLE (very similar to VARBINARY). But in addition, UUID_INDEXABLE cannot contain neither 0x00 nor 0xff in its bytes.

Therefore, what we do with UUID_INDEXABLE , is to transform the UUID to a string of type "f90b6b50-f1b5-459b-b4dd-d6da4ddb5655" in which there will be neither 0x00 or 0xFF (and the length will always be 36 even if the class is defined as fixedWidth = false).

I had developed an algorithm that compact UUIDs to 18 bytes without 0x00 or 0xff, but it broke the order within HBase (it do not sort UUIDS properly in a "select uuid from dummy order by uuid"). I guess it's a big enough weakness to uncheck it; but if there was some algorithm to do it correctly it's as easy as implementing it in the org.apache.phoenix.util.UUIDUtil class. If you want to see the algorithm I am referring to (with its explanation) I can provide it.

This data type, UUID_INDEXABLE , must be kept hidden from client programs; and if it can be it can be somehow forbidden (I haven't found a way) to use it in a direct DDL, it would be ideal. a direct DDL, it would be ideal.

"UUID ARRAY" is also implemented.

There is no need to implement a "UUID_INDEXABLE ARRAY" because I haven't found any use case in which it could be given; because it would only make sense in the primary key as the last value (being an array); and if that were the need, you have to use UUID ARRAY.

I have some internal doubts: 1)Is this logic correct (is it the one implemented)? assertTrue(PUUID.INSTANCE.isCoercibleTo(PUUIDIndexable.INSTANCE)); assertFalse(PUUIDIndexable.INSTANCE.isCoercibleTo(PUUID.INSTANCE));

or it should be

    assertTrue(PUUID.INSTANCE.isCoercibleTo(PUUIDIndexable.INSTANCE));
    assertTrue(PUUIDIndexable.INSTANCE.isCoercibleTo(PUUID.INSTANCE));

2)In reference to the sqlType I have put: PUUID -> 2100 PUUIDIndexable -> 2101

should be other numbers?

2)the ordinal, when inserting the UUID and the PUUIDIndexable I have had to "shift down" all the Array types. To do so, I created a list of costants in order to to facilitate the realization of this reordering now, and possible future changes of order. But for this I have had to make minimal changes in all the type classes.

For example: ... private PVarchar() { super("VARCHAR", Types.VARCHAR, String.class, null, PDataType.ORDINAL_VARCHAR); } ... (this is the reason why there are modifications in the 48 types classes. But it only consists of changing the numerical value by the costant defined in PDataType)

in org.apache.phoenix.jdbc.PhoenixPreparedStatement and in org.apache.phoenix.jdbc.PhoenixResultSet I have added a set and a get UUID respectively. The client would have the option to do: ... if (ps instanceof PhoenixPreparedStatement) { PhoenixPreparedStatement phops = (PhoenixPreparedStatement) ps; phops.setUUID(1, uuidRecord1PK); //Compiler error time compiler error time } else { ps.setObject(1, uuidRecord1PK); //Runtime error detection } ....
```
 	if (rs instanceof PhoenixResultSet) {
         PhoenixResultSet phors = (PhoenixResultSet) rs;
         UUID temp=phors.getUUID(1); //Compiler runtime error
```

compiler time error } else { UUID temp=(UUID)phors.getObject(1); //Runtime error detection }

....

The possible drawback that I see is that since it is neither in java.sql.ResultSet nor in java.sqlPreparedStatement, it "breaks" the standard. standard.

The test "org.apache.phoenix.expression.function.UUIDFunctionTest", test name "testUUIDFunctions" I've built it based on "org.apache.phoenix.expression.function.InstrFunctionTest" test name. "testInstrFunction". I have seen that they test with both SortOrder.ASC and SortOrder.DESC. In the tests for UUID, if I use SortOrder.ASC it gives no problem no problems; but if I use SortOrder.DESC it gives me strange behaviors strange behaviors that I can't explain. I don't know if SortOrder.DESC makes sense in this case or if I'm missing something that I don't see.
in general I don't know if I have done many or few tests.

Jul 06 '22 09:07 alejandro-anadon

Hi some high level questions first.

So the use case where I've seen UUIDs used most is for fast unique client side generated, server side validated generation. In general, the is somewhat simple use case as you can use a uuid generator in a strin gor similar and attempt to push it. The part where Phoenix isn't great is guaranteeing uniqueness for this rowid on the server, there are hacky solutions I'm aware of to achieve this but due to use of UPSERT and lack of data coming back I think there is some difficulty differentiating a put from a update. Does this solve/describe the type of use cases you are interested in? If so I'm thinking a PK constraint as opposed to an actual data type might solve this better. Thoughts? Also maybe @tkhurana might chime in as I think he has good knowledge around some of the details of similar use cases.

Note, for a fixed length type having 0x00 or 0xFF as part of the rowkey should be fine for indexing etc as far as i'm aware.

Jul 07 '22 23:07 dbwong

hi,

if I may, I will start with the last sentence:"for a fixed length type having 0x00 or 0xFF as part of the rowkey should be fine for indexing".

This is 100% correct. But the problem is when you want to make an index with a fixed length type that is not part of the rowkey.

Since it is not part of the rowkey, it can be null.

And there is the need to create a variable length type, PUUIDIndexable (and that can not have neither 0x00 or 0xFF) although when it comes to the truth, it will always have the same length; but it has to be declared as a variable length.

this translate is performed in the class "org.apache.phoenix.util.org.apache.phoenix.util" in the method "getIndexColumnDataType":

... // Since we cannot have nullable fixed length in a row key // we need to translate to variable length. The verification that we have a valid index // row key was already done, so here we just need to convert from one built-in type to // another. public static PDataType getIndexColumnDataType(boolean isNullable, PDataType dataType) { ....

For example, the INTEGER type is translated to DECIMAL when you want to make an index and it is not PK. look at the following example with the field "NOT_IN_PK". In the table it is of type INTEGER, but in the index it is translated to type DECIMAL:

In the case of UUID, it would be exactly the same:

if it is in PK (NOT NULL) or it is a field outside the PK (NULLABLE), it would be of type UUID.
But if we want to make an index on the field that is not PK, the index is translated to UUID_INDEXABLE:

This would be the only (but very important) case in which the UUID_INDEXABLE type would be used and it would be at internal level. What would have to be done is not to show (or not to recommend its use) to the public.

Going back to the beginning of the comment, in which you say that when you have seen UUIDs used, what is done is to use "a uuid generator in a string or similar and attempt to push it". it is true that it can be done that way.

But that can be done with all data types. You can always convert, for example, an integer to String with 'Integer.toString(i)' and do a push. And when retrieved do an 'Integer.parseInt(rs.getString(1))'.

Why not do that with integers, doubles, etc.? because:

firstly prevents errors of type 'ps.setString(1,Integer.parseInt("not a number"))' making it much safer and with fewer errors than with a 'ps.setInt(1,i)'.
it is much more convenient for the programmer not to have to deal with conversions from one type to another.

The question is: Does UUID have enough entity to have its own type? Since this type is being more and more widely used and many databases are implementing it (for example, https://www.postgresql.org/docs/current/datatype-uuid.html), in my opinion it does.

Regarding the use case you are asking about, I have given as an example the use as PK. But there are more interesting use cases.

Some examples are (text extracted from https://www.mysqltutorial.org/mysql-uuid/ ):

"[...]Using UUID for a primary key  brings the following advantages:

UUID values are unique across tables, databases, and even servers that allow you to merge rows from different databases 
	or distribute databases across servers.
UUID values do not expose the information about your data so they are safer to use in a URL. For example, if a customer 
	with id 10 accesses his account via http://www.example.com/customers/10/ URL, it is easy to guess that there is a 
	customer 11, 12, etc., and this could be a target for an attack.
UUID values can be generated anywhere that avoid a round trip to the database server. It also simplifies logic in the application. 
	For example, to insert data into a parent table and child tables, you have to insert into the parent table first, 
	get generated id and then insert data into the child tables. By using UUID, you can generate the primary key value
	of the parent table up front and insert rows into both parent and child tables at the same time within a transaction."

Of course it has some cons, but I think the benefits outweigh the cons.

The most difficult part to solve is the possibility of a UUID collision at the moment of its generation and that it coincides with one that is in the database. in Phoenix.

It would be perfect if Phoenix would include something that guarantees that uniqueness when doing the upsert. As you say, to differentiate a "put" from an "update". But that is far from my knowledge. What I can say is that this dilemma occurs with any table that is created in phoenix:

... CREATE TABLE DUMMY_WITH_CHAR (ID CHAR(16) NOT NULL, NOT_IN_PK CHAR(16), CONSTRAINT NAME_PK PRIMARY KEY(ID)); ...

Is there any mechanism (except a select prior to the upsert) that when doing an upsert guarantees me that the id does not exist previously? to the best of my knowledge no. (I guess that's why there is "upsert" (insert or update) and not "insert" or "update").

Returning to the subject of collisions, while they are mathematically possible, as well as a possible SHA256 collision used by banks, the chances of collision in UUIDs are close to zero.

Considering the scope of use of Phoenix, the consequences of a collision are assumable. I mean: if I had to create a military, medical or banking system, where the repercussions of a UUID collision would be severe, I would not use Phoenix. But if I use Phoenix for systems with huge masses of data (which is its ultimate goal), a remote UUID collision, I would consider it acceptable. And if in the end I decide to use it and a collision is critical, I would do a select prior to upsert.

Jul 12 '22 09:07 alejandro-anadon

Thanks for some of the responses. I think we are talking about the same thing but let me elaborate to be clear. The UUID type is fixed length. In order to handle NULL SQL values in 2ndary indexing the type has to be rewriten to variable length type in order to write NULL. This necessitated the introduction of the concept of UUID indexable internal type. It is fine for the fixed width type to contain 0x00 in the ascending or 0xFF in the descending but not allowed in variable length type. You had an initial approach via disallowing the use of 0x00 (0xFF in desc) initially by modification of the UUID algorithm that you mentioned and thus the idea of not allowing those in the initial type but decided with the Byte[] (UUID) -> Varchar (UUID indexable) approach. Is this the correct summary?

I'm okay with the risk of collision as long as it is documented. Though work by Tanuj recently around atomic upsert there would be a way to try and enforce uniqueness at upsert time but I agree it is not necessary for this feature.

So I think I understand the SQL changes which is the introduction of the type and the addition of 2 consistent functions and 1 random function. 2 additional to JDBC apis plus some additional type information from metadata APIs.

On the desc vs ascending tests can you describe the weirdness?

Jul 12 '22 21:07 dbwong

Thanks for the time and atention.

yes. Your summary is correct.

Just to point out that I also added a UUID ARRAY type, but not a UUID_INDEXABLE ARRAY because I did not see a use case.

I would like to see a few points:

For the UUIDIndexable I tried to develop an algorithm (more complex than a simple .toString()) to make the UUID more compact. And I succeeded. It stayed at 16 bytes if there was no 0x00 or 0xFF or at 19 bytes if there was any 0x00 or 0xFF. It worked. But I had to reject it because it lost the (byte) order within the hbase when it is part of the primary key making the "select" with an "order by" not work.

With the .tosTring() I get that there is neither 0x00 nor 0xFF and also the order is maintained (but it is less compact and perhaps slower). If you see fit I can go deeper in the explanation of this algorithm and attach it but I think that having rejected it is not appropriate to do it here.

Regarding the descending vs ascending problem, I could solve it (or so I think). In my first commit I didn't (mistakenly) take into consideration that the keys or indexes could have a DESC; then everything worked until I started to use the DESC. And I didn't understand why.

After some research (the only "documentation" I have of internals os Phoenix is the source code; and I had to do a lot of try and failure) I realized that in the Hbase, when using DESC, the bytes are inverted. That's when I corrected the algorithm so that it would work also when using when using DESC. then I commited the changes.

There is something I may be doing wrong:.

in the method org.apache.phoenix.expression.function.StringToUUIDFunction.evaluate(Tuple tuple, ImmutableBytesWritable ptr) I am doing this check: " String value; if (ptr.get()[ptr.getOffset() + 8] == '-') { value =new String(ptr.get(), ptr.getOffset(), UUID_TEXTUAL_LENGTH,Charset.forName("UTF-8")); } else { byte[] temp =SortOrder.invert(ptr.get(), ptr.getOffset(), new byte[36], 0, UUID_TEXTUAL_LENGTH); value = new String(temp, Charset.forName("UTF-8")); } "

Why this " == '-' " check?? because when writing the tests, in the test org.apache.phoenix.expression.function.UUIDFunctionTest.testUUIDFuntions() I have two checks: A) to watch over the ASC UUID uuid = UUID.randomUUID(); String uuidStr = testUUIDToStringFunction(uuid, null, SortOrder.ASC); UUID returnUUid = testStringToUUIDfunction(uuidStr, null, SortOrder.ASC); assertTrue(uuid.equals(returnUUid));

B) to watch over the DESC: uuid =UUID.randomUUID(); uuidStr = testUUIDToStringFunction(uuid, null, SortOrder.DESC); returnUUid = testStringToUUIDfunction(uuidStr, null, SortOrder.DESC); assertTrue(returnUUid.equals(UUID.fromString(uuidStr)));

When it is ASC, there is no problem, but in DESC, if I don't make that check (in StringToUUIDFunction), and I use directly what is in 'ptr' it gives error because the bytes are inverted and I have to invert them again.

If I don't make the StringToUUIDFunction check of " == '-' " and use directly what is in the ptr, all the other tests (disabling the DESC test in testUUIDFuntions) work correctly.

This makes me think that what is happening is that I have badly designed the DESC test in UUIDFunctionTest and I am "adapting" the function StringToUUIDFunction only for that test and that in real life it would never happen that: that StringToUUIDFunction never arrives the 'ptr' with the inverted bytes.

I think that in order for that to happen, there should be someting like: select STR_TO_UUID('uuid-string-with-bytes-inverted'); And that makes no sense.

So, what do you think? Should I remove the StringToUUIDFunction validation of " == '-' " and remove the DESC test? Can it be that the DESC test in UUIDFunctionTest is badly designed?

Is this logic correct ? (this is what now it is implemented):

assertTrue(PUUID.INSTANCE.isCoercibleTo(PUUIDIndexable.INSTANCE)); assertFalse(PUUIDIndexable.INSTANCE.isCoercibleTo(PUUID.INSTANCE));

or it should be: assertTrue(PUUID.INSTANCE.isCoercibleTo(PUUIDIndexable.INSTANCE)); assertTrue(PUUIDIndexable.INSTANCE.isCoercibleTo(PUUID.INSTANCE));

sqlTypes I had put: PUUID -> 2100 PUUIDIndexable -> 2101

should be other numbers? (easy to change because I grouped all numbers in PDataType)

In the Apache Yetus(jenkins) errors report (which appears and disappears like a ghost), I am seeing strange things:

A) error: unit : it is reporting "refCount leaked" many times; And this error seems to be a random thing, because in the report that I get two days ago coincides with the "refCount leaked" but in totally different places. Also the tests that I run locally pass the tests successfully.

B) error: checkstyle:

Type error 1. Instantiation of String. There are 5 instances, but this is a sample: "p.e. ./phoenix-core/src/main/java/org/apache/phoenix/expression/function/StringToUUIDFunction.java:87: value = new String(temp, Charset.forName("UTF-8")); Instantiation of java.lang.String should be avoided. [IllegalInstantiation]" Is it a checkstyle error? (https://github.com/checkstyle/checkstyle/issues/2404 ) If not, how can I instate a String ?

Type error 2: Checkstyle WhitespaceAround Errors in places where I didn't made changes: "p.e. ./phoenix-core/src/main/java/org/apache/phoenix/util/IndexUtil.java:635: if(skipWAL) {:9: 'if' is not followed by whitespace. [WhitespaceAround]"

Ok; there is a WhitespaceAround error... but, It comes from before I didn't touch it. Should I correct it? My way of working is to try to change as little as possible and I think it would be worse if I change it. But if it is convenient to change it, I change it.

Type error 3: Checkstyle Indentation Errors. Lines 95,96 and 97: "p.e. ./phoenix-core/src/main/java/org/apache/phoenix/schema/types/PDataTypeFactory.java: 95,96 and 97 types.add(PUUID.INSTANCE);: 'ctor def' child has incorrect indentation level 4, expected level should be 8. [Indentation]"

The lines before and after are indentation level 4; not 8. So if I indent to 8, it will be something like: ... types.add(PUnsignedTinyint.INSTANCE); types.add(PUnsignedTinyintArray.INSTANCE); types.add(PUUID.INSTANCE); <-- this line will be over indentation. Can't show it in this preview window types.add(PUUIDArray.INSTANCE); <-- this line will be over indentation. Can't show it in this preview window types.add(PUUIDIndexable.INSTANCE); <-- this line will be over indentation. Can't show it in this preview window // is it necessary to build a PUUIDIndexableArray type ? Now I don't find any case types.add(PVarbinary.INSTANCE); types.add(PVarbinaryArray.INSTANCE); ... What should I do? left it as level 4 or change all the file to level 8 ? I prefer to left it as level 4 (minimun changes, but Checkstyle Indentation Error)

C) misconception or misunderstanding:

in PUUID , PUUIDIndexable and PUUIDArray: should I implement the functions 'hashCode()' and/or 'equals()'? I have see that in the other types this is not done; so I didn't implement them. But if I don't do it I have a "spotbugs" that says that I have to do it; (the report of two days ago was removed from the pull request history, so I don't know what is the exact description is) Why does it give this error in these new types and in the others it doesn't give the error? And the main question: do I have to implement the functions 'hashCode()' and/or 'equals()'?

Jul 13 '22 13:07 alejandro-anadon

After the last commit I reached a stopping point.

I would like to continue but I will need someone to tell me how to proced on what I had said in my previous comment. Almost everything comes from the reports that Apache Yetus(jenkins) puts out. To summarize:

PUUID, PUUIDIndexable and PUUIDArray (error: spotbugs -> HE_INHERITS_EQUALS_USE_HASHCODE: Class inherits equals() and uses Object.hashCode()): It is a warning, not error, but should they implement the 'hashCode()' and/or 'equals()' methods even though the other types do not?
how do I solve the following checkstyles errors: [IllegalInstantiation] with the String? should I use another class?

for example: "./phoenix-core/src/main/java/org/apache/phoenix/util/UUIDUtil.java:83: uuidStr = new String(temp, Charset.forName("UTF-8")); :27: Instantiation of java.lang.String should be avoided. [IllegalInstantiation]"

Is it ok or do I have to correct the above question regarding StringToUUIDFunction.evaluate() and UUIDFunctionTest.testUUIDFunctions() ?
sqlTypes ids for PUUID -> 2100 and PUUIDIndexable -> 2101: are they correct ?
is the following logic correct? assertTrue(PUUID.INSTANCE.isCoercibleTo(PUUIDIndexable.INSTANCE)); assertFalse(PUUIDIndexable.INSTANCE.isCoercibleTo(PUUID.INSTANCE));
The errors that 'Apache Yetus(jenkins) error: unit' are reporting ("InListIT.cleanUp:88 refCount leaked", "MapReduceIT.clearCountersForScanGrouper:93 refCount leaked" and others): are they really errors?

The tests I run localy don't give me any errors (except some of "NeedTheirOwnClusterTests" but @stoty told me in another fix that "[...]there are also some flaky tests, so that's OK"). I am testing against hbase 2.4.1. I attach the log.

should I fix checkstyles errors (Indentation and WhitespaceAfter) that were before the changes and that have nothing to do with the changes I apply?

Thank you

tests logs.txt

Jul 14 '22 13:07 alejandro-anadon

I'll try to look in more detail in the next week but have been tied up sofar.

Jul 21 '22 04:07 dbwong

You can ignore point number 6. I guess @stoty has changed/fixed something in 'Apache Yetus(jenkins) error: unit' and, as I imagined, no longer gives unit errors of type "InListIT.cleanUp:88 refCount leaked" (or any other error) in this branch.

But they still appear: -checkstyles errors: 'Instantiation of java.lang.String should be avoided. [IllegalInstantiation]' (how do I instate Strings?)

-errors of type 'child has incorrect indentation level 4, expected level should be 8. [Indentation]'. which I think it would be a mistake to indent them to level 8 because the rest of the code is indented to level 4.

HE_INHERITS_EQUALS_USE_HASHCODE: Class inherits equals() and uses Object.hashCode()): in classes PUUID, PUUIDIndexable and PUUIDArray . It is a warning, not error, but I do not know if should they be implement the 'hashCode()' and/or 'equals()' methods even though the other types do not.

do not leave out the question about StringToUUIDFunction.evaluate() and UUIDFunctionTest.testUUIDFunctions()

Aug 09 '22 13:08 alejandro-anadon

Hi @dbwong ,

could you read my comments?

I do not know if you had time or if you are waiting some kings of changes in code. If it is the case, I was waiting your comments for the changes.

Thank you

Oct 20 '22 17:10 alejandro-anadon

Hi @dbwong ,

I haven't had an answer for a long time and I don't know if it's because you haven't had time to see my comments or if it's because this feature is considered unnecessary or useless (I think it's useful: many databases have it implemented). If this were the case, I would like it to be indicated to stop being pending and close the pull request.

Thank you

Dec 16 '22 12:12 alejandro-anadon

It's already the ned of year holidays, I suggest pinging ppl about the patch mid-january, when ppl are back and have caught up with work.

Dec 21 '22 17:12 stoty

Hi @dbwong and @stoty ,

I would like to know if, or you have not had time to review this new feature and answer my questions in order to finish the pull request, or if, on the contrary, it is not considered interesting (I still think it is).

Thank you.

Mar 01 '23 13:03 alejandro-anadon

phoenix phoenix copied to clipboard

Phoenix-6742 Add UUID type

phoenix
phoenix copied to clipboard