spark-avro icon indicating copy to clipboard operation
spark-avro copied to clipboard

Spark avro does lossy schema conversion

Open robert3005 opened this issue 9 years ago • 1 comments

I know this is necessary to fit into avro file format. However, the way it's done currently leads to information loss and requires callers to encode the type conversions avro does themselves.

Spark Datasources should require a schema transformation function to be defined if there's any schema transformation being done.

Spark Avro could easily implement it by extracting schema conversion function from AvroOutputWriter. I would be happy to submit a pr and change upstream to spark to make it a first class thing since I imagine more fileformats have same issue (text?).

robert3005 avatar Jul 19 '16 10:07 robert3005

I agree that implicit conversion is bad in this context. This is a behavioral change that we should consider making in a 4.x release. For now, though, I think that the best that we can hope to do while maintaining compatibility for existing users is to add a big warning about implicit lossy schema conversion and to add an option to prohibit such implicit conversions (maybe a new setting key called strictMode or something similar). Happy to accept PRs for this.

JoshRosen avatar Nov 27 '16 21:11 JoshRosen