parquet4s
parquet4s copied to clipboard
ProjectionSchema self-inconsistency with partitioned source
Hi,
I have faced self-inconsistency with handling of projection schema.
When reading partitioned parquet using ParquetReader.projectedGeneric(expectedSchema).options(...).read
the output rows contain partitioning columns even if expectedSchema
doesn't.
When reading not-partitioned parquet the output rows contain columns listed in expectedSchema
.
I am not sure what the "proper" behavior is, but observed one looks not self-consistent.
Parquet4s version 2.18.0 Here is a test case:
val rows = List(
Row("a_value", "b_value1", "c_value1"),
Row("a_value", "b_value1", null)
)
val df = spark.createDataFrame(
rows.asJava,
StructType(
Array(
StructField("a", StringType),
StructField("b", StringType),
StructField("c", StringType)
)
)
)
df.write.parquet("s3a://sample-bucket/p0")
df.write.partitionBy("a").parquet("s3a://sample-bucket/p1")
val expectedSchema = new MessageType(
"root",
List[Type](
Types
.primitive(PrimitiveTypeName.BINARY, Repetition.OPTIONAL)
.as(LogicalTypeAnnotation.stringType())
.named("c")
).asJava
)
def showP4S(path: String): List[String] = {
Using(
ParquetReader
.projectedGeneric(expectedSchema)
.options(opt)
.read(Path(new HPath(path)))
)(_.iterator.map { row =>
row.foldLeft("") { case (str, (k, v)) => str + s"$k = $v, " }
}.toList).get
}
println("p4s not-partitioned")
showP4S("s3a://sample-bucket/p0").foreach(println)
println("p4s partitioned")
showP4S("s3a://sample-bucket/p1").foreach(println)
The output is:
p4s not-partitioned
c = BinaryValue(Binary{8 constant bytes, [99, 95, 118, 97, 108, 117, 101, 49]}),
c = NullValue,
p4s partitioned
c = BinaryValue(Binary{8 constant bytes, [99, 95, 118, 97, 108, 117, 101, 49]}), a = BinaryValue(Binary{"a_value"}),
c = NullValue, a = BinaryValue(Binary{"a_value"}),