This addresses https://github.com/databricks/spark-avro/issues/52

So, this patch might be a bit ugly and, given that I've done this internally, it may not be up-to-snuff, but I'm willing to get it cleaned up in the interest of making this a worthwhile contribution. Its currently running for us on a number of spark jobs and working properly. Please tear into it and provide feedback. I will try to have the unit tests added over the weekend.

Oh, and hello! :)

Mar 24 '17 22:03 lindblombr

Codecov Report

Merging #222 into master will increase coverage by 1.7%. The diff coverage is 94.36%.

@@            Coverage Diff            @@
##           master     #222     +/-   ##
=========================================
+ Coverage   90.46%   92.16%   +1.7%     
=========================================
  Files           5        5             
  Lines         325      383     +58     
  Branches       51       73     +22     
=========================================
+ Hits          294      353     +59     
+ Misses         31       30      -1

Mar 24 '17 22:03 codecov-io

What is the status on this?

Sep 14 '17 00:09 KamalKang

I know Spark 2.2.0 support has recently been merged, so there is some activity, but I'm curious who I need to reach out to in order to get some eyes on this PR. There's not a contributing doc AFAIK, so I'm at a loss as to how to push this forward.

Sep 17 '17 16:09 lindblombr

I have additional enhancements and fixes to this code, but before I go through the exercise of rebasing and incorporating them, I'd like to know whether this might get looked at. Tagging some top contributors... @JoshRosen @marmbrus

Thanks in advance.

Dec 09 '17 23:12 lindblombr

@lindblombr could you resolve the conflict?

Ping @cloud-fan @liancheng for review.

Thanks.

Apr 27 '18 23:04 dbtsai

If the spark schema doesn't match the specified avro schema, what shall we do? And shall we allow compatible schema changing like int to long?

May 03 '18 03:05 cloud-fan

@cloud-fan If the schemas don't match, my thinking was that writing would fail (in some way). But, if we want it to be more elegant, I'd need to implement some compatibility checking and better error handling. Otherwise, the errors can occur in any number of places, making problems difficult to diagnose. The intent of this was to handle, specifically, the case of reading in a set of avro files and writing that same set of avro files out using the same writer schema. In this case, it works well. For cases outside of this, it seems it would be the responsibility of the developer to ensure type consistency.

If we want this change to be more generic, I can add handling for "forward compatible" types, like "int" => "long", "float" => "double", etc.

Barring that, I would be happy to add some more detailed error handling, so we can say things like

SchemaMismatchException: Dataset to store has field foo which does not exist in forceSchema 'myschema'.
SchemaFieldTypeMismatchException: Dataset to store has field bar of type 'string' while forceSchema specified field bar of type 'long'
SchemaMismatchException: Dataset to store has no fields that match forceSchema
WARN field a of type int of dataset will be converted to Avro type long using forceSchema
ERROR field a of type long cannot be converted to Avro type int

etc. etc.

Thanks so much for taking a look!

May 03 '18 04:05 lindblombr

As a start, I think we can simply require the spark schema to be same as avro schema, while accepting namespace/field name difference.

May 03 '18 06:05 cloud-fan

Add forceSchema option to output to specified schema

Codecov Report