Add forceSchema option to output to specified schema
This addresses https://github.com/databricks/spark-avro/issues/52
So, this patch might be a bit ugly and, given that I've done this internally, it may not be up-to-snuff, but I'm willing to get it cleaned up in the interest of making this a worthwhile contribution. Its currently running for us on a number of spark jobs and working properly. Please tear into it and provide feedback. I will try to have the unit tests added over the weekend.
Oh, and hello! :)
Codecov Report
Merging #222 into master will increase coverage by
1.7%. The diff coverage is94.36%.
@@ Coverage Diff @@
## master #222 +/- ##
=========================================
+ Coverage 90.46% 92.16% +1.7%
=========================================
Files 5 5
Lines 325 383 +58
Branches 51 73 +22
=========================================
+ Hits 294 353 +59
+ Misses 31 30 -1
What is the status on this?
I know Spark 2.2.0 support has recently been merged, so there is some activity, but I'm curious who I need to reach out to in order to get some eyes on this PR. There's not a contributing doc AFAIK, so I'm at a loss as to how to push this forward.
I have additional enhancements and fixes to this code, but before I go through the exercise of rebasing and incorporating them, I'd like to know whether this might get looked at. Tagging some top contributors... @JoshRosen @marmbrus
Thanks in advance.
@lindblombr could you resolve the conflict?
Ping @cloud-fan @liancheng for review.
Thanks.
If the spark schema doesn't match the specified avro schema, what shall we do? And shall we allow compatible schema changing like int to long?
@cloud-fan If the schemas don't match, my thinking was that writing would fail (in some way). But, if we want it to be more elegant, I'd need to implement some compatibility checking and better error handling. Otherwise, the errors can occur in any number of places, making problems difficult to diagnose. The intent of this was to handle, specifically, the case of reading in a set of avro files and writing that same set of avro files out using the same writer schema. In this case, it works well. For cases outside of this, it seems it would be the responsibility of the developer to ensure type consistency.
If we want this change to be more generic, I can add handling for "forward compatible" types, like "int" => "long", "float" => "double", etc.
Barring that, I would be happy to add some more detailed error handling, so we can say things like
- SchemaMismatchException: Dataset to store has field foo which does not exist in forceSchema 'myschema'.
- SchemaFieldTypeMismatchException: Dataset to store has field bar of type 'string' while forceSchema specified field bar of type 'long'
- SchemaMismatchException: Dataset to store has no fields that match forceSchema
- WARN field a of type int of dataset will be converted to Avro type long using forceSchema
- ERROR field a of type long cannot be converted to Avro type int
etc. etc.
Thanks so much for taking a look!
As a start, I think we can simply require the spark schema to be same as avro schema, while accepting namespace/field name difference.