spark-avro icon indicating copy to clipboard operation
spark-avro copied to clipboard

.avro files not found in the Folder.

Open mkanchwala opened this issue 9 years ago • 19 comments

Hi Guys,

I used Hive 1.0+ for the CSV to Avro conversion and it saved the file without .avro extension.

Following is the Exception :

Exception in thread "main" java.lang.RuntimeException: Could not find .avro file with schema at s3://sample_bucket/

I also tried to set this in my Hadoop Conf and Spark Context's Hadoop Conf.

avro.mapred.ignore.inputs.without.extension

is also not working

mkanchwala avatar Aug 12 '15 11:08 mkanchwala

I had to remove the check for it :

https://github.com/mkanchwala/spark-avro/blob/update_artifacts/src/main/scala/com/databricks/spark/avro/AvroRelation.scala

Can you please take care of this change in 1.1.0 also. I think there we also have the check for .avro extension and which could be potentially not compatible with the other frameworks.

mkanchwala avatar Aug 13 '15 06:08 mkanchwala

/cc @marmbrus, how should we handle this? Simply omitting the check probably isn't sufficient in case there are Hadoop temp files in the same directory, right?

JoshRosen avatar Aug 25 '15 18:08 JoshRosen

Actually, this issue looks like a possible duplicate of #40.

JoshRosen avatar Aug 25 '15 18:08 JoshRosen

For schema discovery in parquet I think we look for any file, regardless of extension (but filter out files that start with . or _).

marmbrus avatar Aug 25 '15 18:08 marmbrus

@JoshRosen yup! The Same problem I am facing, For time being in my production environment I removed .avro extension (From update artifacts branch) and rebuild it. But as per the referenced issue I am facing a problem with multi hierarchy filesystem now and the #75 issue.

So can you guys help me out here.

Thanks

mkanchwala avatar Aug 26 '15 05:08 mkanchwala

What version was this observed in? I contributed a patch that was merged back in April to make spark-avro observe the same property that the underling Hadoop InputFormat checks for: https://github.com/databricks/spark-avro/pull/43

master should already work if you set that - in fact it does for us, we're using a snapshot we built ourselves in our systems until a release is out.

jaley avatar Aug 26 '15 06:08 jaley

Currently I am using spark-avro_2.10 - 1.0.0 and Apache Spark 1.4.1

mkanchwala avatar Aug 26 '15 06:08 mkanchwala

Do we have a test case for this scenario? Even if we believe that it has been fixed, we should probably add a test to make sure it doesn't break in the future.

JoshRosen avatar Aug 26 '15 06:08 JoshRosen

@JoshRosen Sorry, nope, no test that I'm aware of. My patch didn't add one at least.

@mkanchwala Perhaps try building a snapshot from master to see if you get the same problem? The patch I mentioned didn't make the 1.0.0 release.

jaley avatar Aug 26 '15 07:08 jaley

Ok Sure. Let me try that with current master branch databricks/spark-avro/master

mkanchwala avatar Aug 26 '15 07:08 mkanchwala

It got stucked on the save call i.e. df.save(outpath, "com.databricks.spark.avro") , I've also used unionAll to create a single RDD.

15/08/26 11:39:32 INFO avro.AvroRelation: using snappy for Avro output 15/08/26 11:39:32 INFO output.DirectFileOutputCommitter: Nothing to setup since the outputs are written directly. 15/08/26 11:39:32 INFO spark.SparkContext: Starting job: save at DataClassifierApp.scala:84 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Got job 2 (save at DataClassifierApp.scala:84) with 2 output partitions (allowLocal=false) 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Final stage: ResultStage 3(save at DataClassifierApp.scala:84) 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Parents of final stage: List() 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Missing parents: List() 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[269] at createDataFrame at DataClassifierApp.scala:60), which has no missing parents 15/08/26 11:39:32 INFO storage.MemoryStore: ensureFreeSpace(76896) called with curMem=7812634, maxMem=901262868 15/08/26 11:39:32 INFO storage.MemoryStore: Block broadcast_61 stored as values in memory (estimated size 75.1 KB, free 852.0 MB) 15/08/26 11:39:32 INFO storage.MemoryStore: ensureFreeSpace(27279) called with curMem=7889530, maxMem=901262868 15/08/26 11:39:32 INFO storage.MemoryStore: Block broadcast_61_piece0 stored as bytes in memory (estimated size 26.6 KB, free 852.0 MB) 15/08/26 11:39:32 INFO storage.BlockManagerInfo: Added broadcast_61_piece0 in memory on localhost:45277 (size: 26.6 KB, free: 858.8 MB) 15/08/26 11:39:32 INFO spark.SparkContext: Created broadcast 61 from broadcast at DAGScheduler.scala:874 15/08/26 11:39:32 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 3 (MapPartitionsRDD[269] at createDataFrame at DataClassifierApp.scala:60) 15/08/26 11:39:32 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 2 tasks

I think there is some performance related issue with latest snapshot, But may be this patch worked for the goal we needed. I was able to read the avro files wihout extension

15/08/26 11:34:18 INFO s3n.S3NativeFileSystem: Opening 's3://sample-bucket/sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000115' for reading 15/08/26 11:34:19 INFO s3n.S3NativeFileSystem: Stream for key 'sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000115' seeking to position '134408' 15/08/26 11:34:19 INFO rdd.HadoopRDD: Input split: 's3://sample-bucket/sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000096:0+145883 15/08/26 11:34:19 INFO s3n.S3NativeFileSystem: Opening 's3://sample-bucket/sample-data/f09cae84-9a9a-48c3-bc7e-102d84bc21b8-000096' for reading 15/08/26 11:34:19 INFO executor.Executor: Finished task 37.0 in stage 0.0 (TID 37). 1926 bytes result sent to driver

Can we add testcases for the same as this something important to take care in future for cross-platform compatibility.

Thanks Guys

mkanchwala avatar Aug 26 '15 12:08 mkanchwala

We are using spark-avro_2.10 - 1.0.0 and Apache Spark 1.6 and we have found the same issue. We cannot load avro files without the ".avro" extension. What can we do to load avro files without extension?

laiadescamps avatar Jan 25 '16 14:01 laiadescamps

@laiadescamps Can you try

val sqlContext = new SQLContext(sc)
    sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false")

mkanchwala avatar Jan 26 '16 07:01 mkanchwala

thanks a lot @mkanchwala ! It works! Easy and simple.

laiadescamps avatar Jan 26 '16 08:01 laiadescamps

Hello @mkanchwala I am new bee to spark and facing issue while reading avro files which with without .avro ext. Kindly help !!

Spark version: 1.3 Jar: spark-avro_2.10-1.0.0.jar Hive: hive-1.1.0-cdh5.4.4

I am using hive's insert into to write snappy compressed avro file which is without .avro ext I set avro.mapred.ignore.inputs.without.extension false , as suggested then also I am getting exception Please see steps below spark-shell --jars spark-avro_2.10-1.0.0.jar SQL context available as sqlContext.

scala> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5e9c697a

scala> import sqlContext._ import sqlContext._

scala> import sqlContext._ import sqlContext._

scala> import com.databricks.spark.avro._ import com.databricks.spark.avro._

scala> sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false") scala> val ufos = sqlContext.avroFile("/group/lzrio_ods/db/lzrio_ods_uda.db/tbl_rem_incr/000063_0") java.lang.RuntimeException: Could not find .avro file with schema at /group/lzrio_ods/db/lzrio_ods_uda.db/tbl_rem_incr/000063_0 at scala.sys.package$.error(package.scala:27) at com.databricks.spark.avro.AvroRelation$$anonfun$4.apply(AvroRelation.scala:109) at com.databricks.spark.avro.AvroRelation$$anonfun$4.apply(AvroRelation.scala:109) at scala.Option.getOrElse(Option.scala:120) at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:109) at com.databricks.spark.avro.AvroRelation.(AvroRelation.scala:53) at com.databricks.spark.avro.package$AvroContext.avroFile(package.scala:27) at

narendracs avatar Feb 25 '16 05:02 narendracs

HI @narendracs

In your case I can see you're using Spark 1.3.x, I would suggest you to upgrade on 1.5.x or 1.6.x and try again with latest snapshot spark-avro 2.x(It includes the patch for the same). or else if you wanna stick with the current versions I would suggest you to build the snapshot for spark-avro 1.x with the provided patch for the same and try again with that jar.

mkanchwala avatar Feb 25 '16 09:02 mkanchwala

Thank you so much ... Upgrading cluster is not an option , I ll try to build snapshot !!!

Thanks

Sent from my iPhone

On Feb 25, 2016, at 1:16 AM, Murtaza Kanchwala [email protected] wrote:

HI @narendracs

In your case I can see you're using Spark 1.3.x, I would suggest you to upgrade on 1.5.x or 1.6.x and try again with latest snapshot spark-avro 2.x(It includes the patch for the same). or else if you wanna stick with the current versions I would suggest you to build the snapshot for spark-avro 1.x with the provided patch for the same and try again with that jar.

— Reply to this email directly or view it on GitHub.

narendracs avatar Feb 25 '16 16:02 narendracs

similar issue on spark 1.6.1 with com.databricks:spark-avro_2.10:2.0.1

With same content of a file, if the name is 'f' without .avro extension, another is 'f.avro' with extension, i.e. cp f.avro f

val df=sqlContext.read.avro('f') // works as expected
df.count // gives non-zero count

if avro.mapred.ignore.inputs.without.extension is set to false

val df=sqlContext.read.avro('f') gives no error, df.printSchema works but df.count is always 0. This puzzles me as it definitely inferred schema but no records?

if avro.mapred.ignore.inputs.without.extension not set, then val df=sqlContext.read.avro('f') error with "java.lang.RuntimeException: No avro files present at .."

yiwang avatar Jul 20 '16 18:07 yiwang

@narendracs I check your post and it seems to work fine on my stuff. I was using spark 2.0.1

I set : spark.sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension", "false")

and then :

val df = spark.read.format("com.databricks.spark.avro").load("hdfs:///user/nifi/data/**")

jomach avatar Feb 18 '17 13:02 jomach