spark-avro
spark-avro copied to clipboard
.avro files not found in the Folder.
Hi Guys,
I used Hive 1.0+ for the CSV to Avro conversion and it saved the file without .avro extension.
Following is the Exception :
Exception in thread "main" java.lang.RuntimeException: Could not find .avro file with schema at s3://sample_bucket/
I also tried to set this in my Hadoop Conf and Spark Context's Hadoop Conf.
avro.mapred.ignore.inputs.without.extension
is also not working
I had to remove the check for it :
https://github.com/mkanchwala/spark-avro/blob/update_artifacts/src/main/scala/com/databricks/spark/avro/AvroRelation.scala
Can you please take care of this change in 1.1.0 also. I think there we also have the check for .avro extension and which could be potentially not compatible with the other frameworks.
/cc @marmbrus, how should we handle this? Simply omitting the check probably isn't sufficient in case there are Hadoop temp files in the same directory, right?
Actually, this issue looks like a possible duplicate of #40.
For schema discovery in parquet I think we look for any file, regardless of extension (but filter out files that start with .
or _
).
@JoshRosen yup! The Same problem I am facing, For time being in my production environment I removed .avro extension (From update artifacts branch) and rebuild it. But as per the referenced issue I am facing a problem with multi hierarchy filesystem now and the #75 issue.
So can you guys help me out here.
Thanks
What version was this observed in? I contributed a patch that was merged back in April to make spark-avro
observe the same property that the underling Hadoop InputFormat checks for:
https://github.com/databricks/spark-avro/pull/43
master
should already work if you set that - in fact it does for us, we're using a snapshot we built ourselves in our systems until a release is out.
Currently I am using spark-avro_2.10 - 1.0.0 and Apache Spark 1.4.1
Do we have a test case for this scenario? Even if we believe that it has been fixed, we should probably add a test to make sure it doesn't break in the future.
@JoshRosen Sorry, nope, no test that I'm aware of. My patch didn't add one at least.
@mkanchwala Perhaps try building a snapshot from master to see if you get the same problem? The patch I mentioned didn't make the 1.0.0 release.
Ok Sure. Let me try that with current master branch databricks/spark-avro/master
It got stucked on the save call i.e. df.save(outpath, "com.databricks.spark.avro") , I've also used unionAll to create a single RDD.
15/08/26 11:39:32 INFO avro.AvroRelation: using snappy for Avro output 15/08/26 11:39:32 INFO output.DirectFileOutputCommitter: Nothing to setup since the outputs are written directly. 15/08/26 11:39:32 INFO spark.SparkContext: Starting job: save at DataClassifierApp.scala:84 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Got job 2 (save at DataClassifierApp.scala:84) with 2 output partitions (allowLocal=false) 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Final stage: ResultStage 3(save at DataClassifierApp.scala:84) 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Parents of final stage: List() 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Missing parents: List() 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[269] at createDataFrame at DataClassifierApp.scala:60), which has no missing parents 15/08/26 11:39:32 INFO storage.MemoryStore: ensureFreeSpace(76896) called with curMem=7812634, maxMem=901262868 15/08/26 11:39:32 INFO storage.MemoryStore: Block broadcast_61 stored as values in memory (estimated size 75.1 KB, free 852.0 MB) 15/08/26 11:39:32 INFO storage.MemoryStore: ensureFreeSpace(27279) called with curMem=7889530, maxMem=901262868 15/08/26 11:39:32 INFO storage.MemoryStore: Block broadcast_61_piece0 stored as bytes in memory (estimated size 26.6 KB, free 852.0 MB) 15/08/26 11:39:32 INFO storage.BlockManagerInfo: Added broadcast_61_piece0 in memory on localhost:45277 (size: 26.6 KB, free: 858.8 MB) 15/08/26 11:39:32 INFO spark.SparkContext: Created broadcast 61 from broadcast at DAGScheduler.scala:874 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 3 (MapPartitionsRDD[269] at createDataFrame at DataClassifierApp.scala:60) 15/08/26 11:39:32 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
I think there is some performance related issue with latest snapshot, But may be this patch worked for the goal we needed. I was able to read the avro files wihout extension
15/08/26 11:34:18 INFO s3n.S3NativeFileSystem: Opening 's3://sample-bucket/sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000115' for reading 15/08/26 11:34:19 INFO s3n.S3NativeFileSystem: Stream for key 'sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000115' seeking to position '134408' 15/08/26 11:34:19 INFO rdd.HadoopRDD: Input split: 's3://sample-bucket/sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000096:0+145883 15/08/26 11:34:19 INFO s3n.S3NativeFileSystem: Opening 's3://sample-bucket/sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000096' for reading 15/08/26 11:34:19 INFO executor.Executor: Finished task 37.0 in stage 0.0 (TID 37). 1926 bytes result sent to driver
Can we add testcases for the same as this something important to take care in future for cross-platform compatibility.
Thanks Guys
We are using spark-avro_2.10 - 1.0.0 and Apache Spark 1.6 and we have found the same issue. We cannot load avro files without the ".avro" extension. What can we do to load avro files without extension?
@laiadescamps Can you try
val sqlContext = new SQLContext(sc)
sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false")
thanks a lot @mkanchwala ! It works! Easy and simple.
Hello @mkanchwala I am new bee to spark and facing issue while reading avro files which with without .avro ext. Kindly help !!
Spark version: 1.3 Jar: spark-avro_2.10-1.0.0.jar Hive: hive-1.1.0-cdh5.4.4
I am using hive's insert into to write snappy compressed avro file which is without .avro ext I set avro.mapred.ignore.inputs.without.extension false , as suggested then also I am getting exception Please see steps below spark-shell --jars spark-avro_2.10-1.0.0.jar SQL context available as sqlContext.
scala> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5e9c697a
scala> import sqlContext._ import sqlContext._
scala> import sqlContext._ import sqlContext._
scala> import com.databricks.spark.avro._ import com.databricks.spark.avro._
scala> sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false")
scala> val ufos = sqlContext.avroFile("/group/lzrio_ods/db/lzrio_ods_uda.db/tbl_rem_incr/000063_0")
java.lang.RuntimeException: Could not find .avro file with schema at /group/lzrio_ods/db/lzrio_ods_uda.db/tbl_rem_incr/000063_0
at scala.sys.package$.error(package.scala:27)
at com.databricks.spark.avro.AvroRelation$$anonfun$4.apply(AvroRelation.scala:109)
at com.databricks.spark.avro.AvroRelation$$anonfun$4.apply(AvroRelation.scala:109)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:109)
at com.databricks.spark.avro.AvroRelation.
HI @narendracs
In your case I can see you're using Spark 1.3.x, I would suggest you to upgrade on 1.5.x or 1.6.x and try again with latest snapshot spark-avro 2.x(It includes the patch for the same). or else if you wanna stick with the current versions I would suggest you to build the snapshot for spark-avro 1.x with the provided patch for the same and try again with that jar.
Thank you so much ... Upgrading cluster is not an option , I ll try to build snapshot !!!
Thanks
Sent from my iPhone
On Feb 25, 2016, at 1:16 AM, Murtaza Kanchwala [email protected] wrote:
HI @narendracs
In your case I can see you're using Spark 1.3.x, I would suggest you to upgrade on 1.5.x or 1.6.x and try again with latest snapshot spark-avro 2.x(It includes the patch for the same). or else if you wanna stick with the current versions I would suggest you to build the snapshot for spark-avro 1.x with the provided patch for the same and try again with that jar.
— Reply to this email directly or view it on GitHub.
similar issue on spark 1.6.1 with com.databricks:spark-avro_2.10:2.0.1
With same content of a file, if the name is 'f' without .avro extension, another is 'f.avro' with extension, i.e. cp f.avro f
val df=sqlContext.read.avro('f') // works as expected
df.count // gives non-zero count
if avro.mapred.ignore.inputs.without.extension
is set to false
val df=sqlContext.read.avro('f')
gives no error, df.printSchema
works but df.count
is always 0
.
This puzzles me as it definitely inferred schema but no records?
if avro.mapred.ignore.inputs.without.extension
not set, then
val df=sqlContext.read.avro('f')
error with "java.lang.RuntimeException: No avro files present at .."
@narendracs I check your post and it seems to work fine on my stuff. I was using spark 2.0.1
I set : spark.sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false")
and then :
val df = spark.read.format("com.databricks.spark.avro").load("hdfs:///user/nifi/data/**")