avro icon indicating copy to clipboard operation
avro copied to clipboard

AVRO-2952: AvroDatatype

Open wernerdaehn opened this issue 3 years ago • 13 comments

Make sure you have checked all steps below.

Jira

Tests

  • org.apache.avro.generic.TestDatatype

Commits

  • [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • All the public functions and the classes in the PR contain Javadoc that explain what it does

wernerdaehn avatar Nov 02 '20 20:11 wernerdaehn

While I am no maintainer, I'll still offer my two cents from a short browse over what seems like a very interesting PR.

This PR features an interesting approach to extending the handling of logical types. I like the general concept, but I personally feel that the code as presented is not as mature (or I don't understand it enough) as I'd like to see code that's merged into my favourite serialization framework ;)

Points that should be addressed in my eyes:

  • the methods public String toString() and public String toString(StringBuilder, Value) imply they do something similar, but actually don't
  • the methods public int hashCode() are not implemented consistently, and where they are, they all return 1 which leads to bad hashmapping in cases where that might ever be used
  • public boolean equals() feels odd where it addresses Arrays and Maps
  • instances of override functions that only call their super version are redundant
  • it is unclear to me why some effectively static final fields are only set as static (leads to short lived wtfs when reading)
  • getRecommendedSchema returns null in some cases (arrays, maps, records), but neither interface or javadoc specify that

I wonder if it would be possible to reduce the PR size a lot by using a parameterized base class for the most common datatypes and only override code that actually changes between the different classes. (functions like getBackingType, getRecommendedSchema, getAvroType and getConvertedType are basically the same in all cases)

Again, as I said above, I like the general concept, and I am sure the presented PR can be "prettied up" quite a bit to be easier readable and understandable.

unchuckable avatar Dec 13 '20 18:12 unchuckable

  • the methods public String toString() and public String toString(StringBuilder, Value) imply they do something similar, but actually don't: Correct. But as both methods do not make sense by themselves, both help only in online debugging. I have no problem change the name, though, your choice.
  • the methods public int hashCode() are not implemented consistently, and where they are, they all return 1 which leads to bad hashmapping in cases where that might ever be used: There are two types of LogicalTypes: 1) Types like AvroInt, AvroLong etc that do not need any parameters. Hence they are Singletons. You cannot create a new one, you use the factory method to return the ONE. As there is just a single object, the hashCode() can return anything. Case 2) is where different instances are created because e.g. an AvroVarchar has a length. All these classes have a proper hashCode, e.g. LogicalTypeWithLength returns the hashCode of the length. So here I believe we are fine.
  • public boolean equals() feels odd where it addresses Arrays and Maps: I assume you meant Enum and Map. When should Enum.equals(Enum) return true? If both are Enums or if both are Enums and have the exact schema? Or if both are Enums and the schemas are forward compatible? I believe the first is the best option, as you will use these comparisons to convert values. And for values the possible values do not play a role. Hence I would suggest to stick with the current logic: all Enum datatypes are Enum datatypes.
  • instances of override functions that only call their super version are redundant: True. I thought it is a good idea for readability. To remind myself that there is nothing special. I have no problem changing that.
  • it is unclear to me why some effectively static final fields are only set as static: That I can explain, I am lacy on defining things as final. Agree, that is something I should change.
  • getRecommendedSchema returns null in some cases (arrays, maps, records), but neither interface or javadoc specify that: Deal, will update the Javadoc.
  • getBackingType, getRecommendedSchema: I took some inspiration from the current LogicalTypes, there the classes are rather self sufficient as well. And actually, it does make sense. When something does change and it should return a different object, a one-line change, you do not want to modify code at three different classes. Hence my feeling is it should rather stay as is. But maybe I misunderstood your point. Do you have an example?

Before I make the changes, any counter arguments? Any requests from the Avro maintainers?

wernerdaehn avatar Dec 13 '20 18:12 wernerdaehn

Before I make the changes, any counter arguments? Any requests from the Avro maintainers?

Didn't mean to dishearten BTW, just give the honest feedback ;)

unchuckable avatar Dec 13 '20 19:12 unchuckable

(And I never said that all my contributions are flawless without a doubt. It is also my first contribution to Avro, hence I have no feel for it yet)

wernerdaehn avatar Dec 14 '20 05:12 wernerdaehn

@unchuckable Hi Martin, I made the agreed changes and more of what you have requested initially. For example the hashCode(), the toString(StringBuffer) etc. Will check in the code as soon as the main build is on PASS again.

wernerdaehn avatar Jan 23 '21 20:01 wernerdaehn

Hi, Werner,

that’s nice to hear. (Also good to hear about the main build not being on PASS, because that could explain issues with my own PR ;)). Mind you that I cannot accept your PR but was merely trying to offer constrictive remarks that might be made by others during the review. Lets hope one of the commited maintainers can find the time to review timely.

On 23. Jan 2021, at 21:05, wernerdaehn [email protected] wrote:

@unchuckable https://github.com/unchuckable Hi Martin, I made the agreed changes and more of what you have requested initially. For example the hashCode(), the toString(StringBuffer) etc. Will check in the code as soon as the main build is on PASS again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/avro/pull/973#issuecomment-766171549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEDPCL4NPOPNAPEUKPS3HRDS3MT2BANCNFSM4TH4WI3A.

unchuckable avatar Jan 23 '21 20:01 unchuckable

@RyanSkraba Ready for review!

Just for a side note, I have some more code that I could contribute to Avro. But before doing that, I need to know if this one got accepted and what I can do better in future.

wernerdaehn avatar Feb 02 '21 15:02 wernerdaehn

Hello! I haven't had the time to attack this PR. There's a lot of lines of code, and thanks so much to @unchuckable for picking up the slack on the peer review -- great job, your opinion is welcomed and valued! @wernerdaehn of course, I also appreciate your openness, patience and willingness to contribute your work!

I have some concerns about the interactions between the new AvroDatatype interface being introduced and the existing Conversions, as well as existing non-standard user-defined LogicalTypes -- I really need to give this a thorough read to better understand. With my time constraints, I can commit to doing a read through and discussion in the next week!

I would love to have any other committers weigh in, especially if anyone has any ideas or experience on making big changes less daunting. I don't want to make more work for you!

RyanSkraba avatar Feb 04 '21 14:02 RyanSkraba

Thanks @RyanSkraba, appreciate it. The important part to keep in mind is that using this interface is optional. It neither impacts existing Logical Data Types nor custom data types - I have these myself. And it does not change how you can use existing or custom conversions. It just has easier to use helper classes for the normal cases. Regarding the amount of code, actually most is just a repeat. If you scanned one data type class you know all. The only parts that will be interesting are the changes I made in the LogicalType.java and Schema.java. The first requires to change a few properties to PUBLIC and both include - unfortunately - methods that deal with individual logical types. Anyway, once you read the code you will find that it is all pretty straight forward.

wernerdaehn avatar Feb 04 '21 17:02 wernerdaehn

A quick response: Agree with all you said. We need to make some decisions but whatever you decide, I am fine with.

wernerdaehn avatar Mar 02 '21 18:03 wernerdaehn

I didn't see mention in the JIRA or here, but was this discussed on the mailing list? IIUC this is adding on custom extensions to logical types and additional metadata to the schema? If this is the case it seems like something that other implementations might care about if they will need to support later (i.e. it seems like updating the specification first with optional these optional extensions would be a good thing to do first so there is agreement on how to interpret them). Otherwise i would worry about the java code becoming the specification (which in my experience yields lower quality implementations elsewhere). If this doesn't affect serialized data or interpretation in any way please ignore my comment.

emkornfield avatar Mar 03 '21 05:03 emkornfield

You are right, this is a copy/paste from another discussion. Should be removed from here.

wernerdaehn avatar Mar 03 '21 05:03 wernerdaehn

I have brought this pull request to the current version for easier integration. Especially that one dangling test get replaced by the master version removing my changes. Don't no what else to do and close of giving up.

wernerdaehn avatar Apr 06 '21 13:04 wernerdaehn