avro icon indicating copy to clipboard operation
avro copied to clipboard

AVRO-4060: Use JDK to Hash Byte Array in UTF8

Open belugabehr opened this issue 1 year ago • 5 comments

What is the purpose of the change

  • This pull request improves file read performance by using the JDK hashcode implementation, fixing AVRO-4060.*

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Documentation

  • Does this pull request introduce a new feature? no

belugabehr avatar Sep 25 '24 01:09 belugabehr

I feel there should be a test to verify that the result of Utf8.hashCode does not depend on whether the array has unused bytes in it. TestUtf8.hashCodeReused seems to largely cover that by hardcoding specific hash codes but it does not compute hash codes for equal sequences of bytes in two ways. I'm thinking about test code like

Utf8 u1 = new Utf8("abcdefghi"); // length=9, bytes.length=9
u1.setByteLength(8); // length=8, bytes.length=9
Utf8 u2 = new Utf8("abcdefgh"); // length=8, bytes.length=8
assertEquals(u1.hashCode(), u2.getHashCode());

or with a hardcoded hash code in the test.

KalleOlaviNiemitalo avatar Sep 25 '24 01:09 KalleOlaviNiemitalo

Hello,

Thank you for the feedback. I agree a unit test is appropriate here.

Ideally, the hashcode implementation should return the same value for the same contents regardless on the size of the array. However, that is not currently the case because the two implementations are different. I believe AVRO-4061 will fix the discrepancy between the two implementations and then this unit test becomes possible.

I also need to look at the setLength method. It's very confusing to me. If the buffer is made smaller, then the underlying string is truncated, if the buffer is made larger, than the underlying string is padded with zero (it's not just that the buffer is expanded). I don't really understand the use case of this method just yet. I feel like it shouldn't bother copying anything into the newly sized array since the end result is somewhat confusing.

belugabehr avatar Sep 25 '24 02:09 belugabehr

I have a unit test that confirms the behavior, but it does require AVRO-4061 to be merged first.

belugabehr avatar Sep 25 '24 02:09 belugabehr

Unit test will fail on this PR until AVRO-401 is addressed.

belugabehr avatar Sep 25 '24 02:09 belugabehr

AVRO-4061 is now fixed.

Added unit test for coverage.

belugabehr avatar Sep 30 '24 01:09 belugabehr

@KalleOlaviNiemitalo and @martin-g - Another pass on this one please?

belugabehr avatar Dec 29 '24 05:12 belugabehr