spark-avro Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType

Open arthurdk opened this issue 7 years ago • 4 comments

Hello,

We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1 to com.databricks:spark-avro_2.11:4.0.0.

We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:

df.printSchema
root
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- _0: string (nullable = true)
|    |    |-- _1: integer (nullable = true)
|    |    |-- _2: long (nullable = true)

In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.

At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:

val df = spark.read.avro("/path/to/file/in").write.avro("/path/to/file/out")

And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.

To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:

val df = spark.read.avro("/path/to/file/in")
df.select(size(col("field3")).as("size")).select(avg(col("size")), min(col("size")), max(col("size"))).show
+-----------------+---------+---------+
|        avg(size)|min(size)|max(size)|
+-----------------+---------+---------+
|133.0953942943108|        1|   143220|
+-----------------+---------+---------+

Could you look into this? If you need any additional information feel free to ask!

Feb 02 '18 11:02 arthurdk

The read and write path is indeed slower in current release. For 2.0.1 version:

read path:   Avro => Row
write path: Row => Avro

while in 4.0:

read path:   Avro => Row => InternalRow
write path:  InternalRow => Row => Avro

The conversion between Row and InternalRow is slow.

The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

In the next release, this problem should be fixed as:

read path:   Avro => InternalRow
write path:  InternalRow => Avro

Feb 05 '18 09:02 gengliangwang

Many thanks for the explanation! Do you have an ETA for the next release?

Feb 05 '18 10:02 arthurdk

There is not ETA yet. I will comment this issue once fixed.

Feb 05 '18 10:02 gengliangwang

May I ask if this issue is already fixed?

Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.

Aug 09 '18 11:08 ryanivanka

spark-avro spark-avro copied to clipboard

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType

spark-avro
spark-avro copied to clipboard