spark-avro
spark-avro copied to clipboard
Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType
Hello,
We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1
to com.databricks:spark-avro_2.11:4.0.0
.
We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:
df.printSchema
root
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- _0: string (nullable = true)
| | |-- _1: integer (nullable = true)
| | |-- _2: long (nullable = true)
In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.
At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:
val df = spark.read.avro("/path/to/file/in").write.avro("/path/to/file/out")
And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.
To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:
val df = spark.read.avro("/path/to/file/in")
df.select(size(col("field3")).as("size")).select(avg(col("size")), min(col("size")), max(col("size"))).show
+-----------------+---------+---------+
| avg(size)|min(size)|max(size)|
+-----------------+---------+---------+
|133.0953942943108| 1| 143220|
+-----------------+---------+---------+
Could you look into this? If you need any additional information feel free to ask!
The read and write path is indeed slower in current release. For 2.0.1 version:
read path: Avro => Row
write path: Row => Avro
while in 4.0:
read path: Avro => Row => InternalRow
write path: InternalRow => Row => Avro
The conversion between Row
and InternalRow
is slow.
The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
In the next release, this problem should be fixed as:
read path: Avro => InternalRow
write path: InternalRow => Avro
Many thanks for the explanation! Do you have an ETA for the next release?
There is not ETA yet. I will comment this issue once fixed.
May I ask if this issue is already fixed?
Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.