spark-avro Relax filesystem layout constraints

Hey,

I'd like to use spark-avro to load and merge some Avro data an existing system is producing in large quantities, and convert it to Parquet for interactive, ad-hoc analytics in Spark. I've got two issues preventing it all from "just working" in the current implementation:

Requires that files are named "*.avro".
Requires "flat" directory layout - no support for nesting.

Our data is laid out as <dataset name>/<timestamp>/<random uuid>, which I've not been able to make work with spark-avro as yet. I want to merge all fields under the same dataset name to a set of large Parquet files, which I don't think is possible with the API as it stands now.

Actually, I've had a go at tweaking this myself and what I'm seeing is really puzzling me. If anyone is able to explain to me what's happening here and propose an acceptable change to the API, I'd be happy to send a PR. Here's what I've seen:

There's an initial search in AvroRelation for any file ending in ".avro", which is used to load the schema. This part actually does perform a recursive search from the location parameter.
Loosening the above requirement is straight-forward enough, but data is then loaded using hadoopFile(location, ...), which will only work with a flat directory of data files. Seems odd that the first search was recursive and this part isn't?
Because the location parameter is used both as an argument directly to hadoopFile, and as the root of the recursive search for the schema above, you can't really use the wildcards and such that hadoopFile supports.
Even when I tried hacking it to let me pass in wildcards, for some reason the RDD loaded only matched files ending ".avro". As far as I can see, that requirement was only for the file used to load the schema, but it seems to make it all the way through somehow? I even saw logging from the underlying hadoop FileInputFormat saying it had matched 700 data files, but then only the one file ending in .avro (which I added for testing) was actually loaded into the RDD.

Any suggestions? Like I was saying, if anyone can offer some direction on how best to make this work, I'll happily implement it myself and send over a pull request.

Mar 23 '15 11:03 jaley

I just found that the remaining confusion I was having was coming from the AvroInputFormat itself, so thought I'd add a comment here in case anyone else has the same issue and is googling. There's a property that can be set, avro.mapred.ignore.inputs.without.extension, which controls whether it will read files ending in .avro only, or not.

Still, if anyone is able to direct me on the other comments I was making, I'd be happy to get involved. Apologies for confusion!

Mar 23 '15 12:03 jaley

I have a similar problem, only I need support for globs. In my case I want to only load files matching a pattern within a directory. Recursivity would be useful too, e.g. so I could load s3n://my-bucket/log/web/**/access-*.avro.

It seems the way to do this with Hadoop is using PathFilters. It would be good if support was added for passing one in.

Mar 26 '15 12:03 boosh

jaley, thanks for the tip about avro.mapred.ignore.inputs.without.extension !

Jan 20 '16 19:01 alexnastetsky

I have issue when avro.mapred.ignore.inputs.without.extension is set to false, schema is infered but no records, see https://github.com/databricks/spark-avro/issues/71#issuecomment-234038485

Jul 20 '16 18:07 yiwang