spark-redshift manifest doesn't list all avro files

We use DataBricks RedShift driver to insert data into redshift.

dataFrame.write()
                        .format("com.databricks.spark.redshift")
                        .option("url", fullJDBCUri)
                        .option("tempdir", s3TempDir)
                        .option("dbtable", dbTable)
                        .option("extracopyoptions", "TRUNCATECOLUMNS")
                        .mode(SaveMode.Append)
                        .save();

Not all data is loaded into RedShift. To find the cause, we tried to load a very small amount of data, such as 100 records. This allowed us to discover that manifest.json does not list all the files. Example: {"entries": [{"url":"s3://my-bucket/038c60ff-c931-4d57-b35f-ac018f174bdf/part-r-00003-5fc64882-c214-4627-b094-e6a25a792f28.avro", "mandatory":true}]} Out of 00000-00005 files only 00003 is listed. This looks like a bug in the driver.

This is how we use the driver:

<dependency>
      <groupId>com.databricks</groupId>
      <artifactId>spark-redshift_2.10</artifactId>
      <version>1.1.0</version>
</dependency>

We use 1.1.0 because we use Spark 1.6 Our Spark jobs run on EC2, they use between 1-16 executors, 1-2 cores per executor, the size of the data points is up to 4k validated json docs, plenty of memory, no piling up of spark batches. I will provide additional info if necessary.

I am attaching a sample s3 directory that was not inserted fully.

038c60ff-c931-4d57-b35f-ac018f174bdf.zip

Jan 17 '17 17:01 danielnuriyev

Do the missing Avro files contain rows or are they empty files? Redshift's Avro reader has a bug where it crashes on Avro files which contain now rows, so our code purposely excludes those files from the manifest.

Feb 24 '17 02:02 JoshRosen

@danielnuriyev, in the example that you attached to this GitHub issue there are six Avro files, numbered part-r-00000 through part-r-00005, but the manifest only lists one of them (part-r-00003). Thus I believe your claim that "the missing Avro files are missing" is inaccurate.

I used spark-shell --packages com.databricks:spark-avro_2.11:3.1.0 to spin up a Spark shell with the avro library, then used

scala> spark.read.format("com.databricks.spark.avro").load("/Users/joshrosen/Downloads/038c60ff-c931-4d57-b35f-ac018f174bdf/*.avro").rdd.glom.map(_.size).collect()
res3: Array[Int] = Array(4, 0, 0, 0, 0, 0)

and verified that only one of the files contains records. This is consistent with my hypothesis in my previous comment.

Feb 24 '17 20:02 JoshRosen

Thank you for digging in. I am attaching another zip that contains a manifest and 10 avro files. Most files are really empty except 00003 and 00005. The manifest lists only 00003. Data contained in 00005 was not inserted into the DB. missing-avro.zip We suspect the eventual consistency of S3 to play a role.

Feb 27 '17 15:02 danielnuriyev

Hi All, Has there been any further investigation into this issue? Has it been confirmed to be an eventual consistency issue?

May 12 '17 16:05 russorat

@danielnuriyev - I suspect the issue you are seeing is what I've described here: https://github.com/spark-redshift-community/spark-redshift/issues/74

I know it's been years since you reported this issue. Do you remember if you ever got to the bottom of it?

Jul 20 '20 20:07 nchammas