spark-avro icon indicating copy to clipboard operation
spark-avro copied to clipboard

Add support for ignoring zero byte avro files / corrupted files

Open ssabdb opened this issue 9 years ago • 4 comments

I have an issue where if zero byte avro files (which shouldn't be there and are the result of a rogue process unrelated to this useful library which we're struggling to unpick) cause globs to "fail" - or if another process has just started writing to a partition.

I'd like to add something to the code to ignore them - would there be any support or concerns for me submitting a pull request for this?

ssabdb avatar Sep 05 '16 13:09 ssabdb

The case where another application has started writing to the file seems like it could manifest itself as corrupted / partially-readable files, so it sounds like you might want a more general "ignore bad files facility", or at least one where you have customization of "ignore empty files" vs. "ignore all bad files."

This seems reasonable to me, so please feel free to submit a PR. Make sure to include regression tests (for full line / branch coverage of your patch) and documentation for the new configurations.

JoshRosen avatar Nov 21 '16 23:11 JoshRosen

I am working with Azure Event Hubs (similar service to kafka) which has an ability to archive events to persistent storage (called Event Hubs Archive). EH Archive will create empty files if no data was written in a given time window. In my case, if the is at least one file with data in the resolved globbed path, it will succeed, and just ignore the empty files; but if all files are empty it will fail with 'Not an Avro data file'. I would very much like the reader to just always ignore empty files.

itaysk avatar Jan 30 '17 12:01 itaysk

Has this issue been resolved in 3.2.0 and beyond? I just used 3.2.0 to read avro files from HDFS and the empty files in the directory were ignored (whereas using the more manual newAPIHadoopFile returned the Not an Avro data file error.)

josephpconley avatar Oct 30 '17 15:10 josephpconley

In 4.0.0 it handles the file if the data format is not Avro, But still, if the Avro data is corrupted it does not handle that. I get this AvroRunTimeException : Invalid Sync which is caused by poorly formatted Avro files. But it does not handle it and it is difficult to tackle it using Exception handling as the reading file and doing transformations are lazy and when you want to save the DataFrame you get this Exception. Has someone resolved this?

akarsh3007 avatar May 22 '18 16:05 akarsh3007