manifest doesn't list all avro files
We use DataBricks RedShift driver to insert data into redshift.
dataFrame.write()
.format("com.databricks.spark.redshift")
.option("url", fullJDBCUri)
.option("tempdir", s3TempDir)
.option("dbtable", dbTable)
.option("extracopyoptions", "TRUNCATECOLUMNS")
.mode(SaveMode.Append)
.save();
Not all data is loaded into RedShift.
To find the cause, we tried to load a very small amount of data, such as 100 records.
This allowed us to discover that manifest.json does not list all the files.
Example:
{"entries": [{"url":"s3://my-bucket/038c60ff-c931-4d57-b35f-ac018f174bdf/part-r-00003-5fc64882-c214-4627-b094-e6a25a792f28.avro", "mandatory":true}]}
Out of 00000-00005 files only 00003 is listed.
This looks like a bug in the driver.
This is how we use the driver:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-redshift_2.10</artifactId>
<version>1.1.0</version>
</dependency>
We use 1.1.0 because we use Spark 1.6 Our Spark jobs run on EC2, they use between 1-16 executors, 1-2 cores per executor, the size of the data points is up to 4k validated json docs, plenty of memory, no piling up of spark batches. I will provide additional info if necessary.
I am attaching a sample s3 directory that was not inserted fully.
Do the missing Avro files contain rows or are they empty files? Redshift's Avro reader has a bug where it crashes on Avro files which contain now rows, so our code purposely excludes those files from the manifest.
@danielnuriyev, in the example that you attached to this GitHub issue there are six Avro files, numbered part-r-00000 through part-r-00005, but the manifest only lists one of them (part-r-00003). Thus I believe your claim that "the missing Avro files are missing" is inaccurate.
I used spark-shell --packages com.databricks:spark-avro_2.11:3.1.0 to spin up a Spark shell with the avro library, then used
scala> spark.read.format("com.databricks.spark.avro").load("/Users/joshrosen/Downloads/038c60ff-c931-4d57-b35f-ac018f174bdf/*.avro").rdd.glom.map(_.size).collect()
res3: Array[Int] = Array(4, 0, 0, 0, 0, 0)
and verified that only one of the files contains records. This is consistent with my hypothesis in my previous comment.
Thank you for digging in. I am attaching another zip that contains a manifest and 10 avro files. Most files are really empty except 00003 and 00005. The manifest lists only 00003. Data contained in 00005 was not inserted into the DB. missing-avro.zip We suspect the eventual consistency of S3 to play a role.
Hi All, Has there been any further investigation into this issue? Has it been confirmed to be an eventual consistency issue?
@danielnuriyev - I suspect the issue you are seeing is what I've described here: https://github.com/spark-redshift-community/spark-redshift/issues/74
I know it's been years since you reported this issue. Do you remember if you ever got to the bottom of it?