parquet4s ProjectionSchema self-inconsistency with partitioned source

ProjectionSchema self-inconsistency with partitioned source

Open dpogibelskiy opened this issue 7 months ago • 3 comments

Hi,

I have faced self-inconsistency with handling of projection schema.

When reading partitioned parquet using ParquetReader.projectedGeneric(expectedSchema).options(...).read the output rows contain partitioning columns even if expectedSchema doesn't.

When reading not-partitioned parquet the output rows contain columns listed in expectedSchema.

I am not sure what the "proper" behavior is, but observed one looks not self-consistent.

Parquet4s version 2.18.0 Here is a test case:

val rows = List(
      Row("a_value", "b_value1", "c_value1"),
      Row("a_value", "b_value1", null)
    )
    val df = spark.createDataFrame(
      rows.asJava,
      StructType(
        Array(
          StructField("a", StringType),
          StructField("b", StringType),
          StructField("c", StringType)
        )
      )
    )
    df.write.parquet("s3a://sample-bucket/p0")
    df.write.partitionBy("a").parquet("s3a://sample-bucket/p1")

    val expectedSchema = new MessageType(
      "root",
      List[Type](
        Types
          .primitive(PrimitiveTypeName.BINARY, Repetition.OPTIONAL)
          .as(LogicalTypeAnnotation.stringType())
          .named("c")
      ).asJava
    )

    def showP4S(path: String): List[String] = {
      Using(
        ParquetReader
          .projectedGeneric(expectedSchema)
          .options(opt)
          .read(Path(new HPath(path)))
      )(_.iterator.map { row =>
        row.foldLeft("") { case (str, (k, v)) => str + s"$k = $v, " }
      }.toList).get
    }

    println("p4s not-partitioned")
    showP4S("s3a://sample-bucket/p0").foreach(println)
    println("p4s partitioned")
    showP4S("s3a://sample-bucket/p1").foreach(println)

The output is:

p4s not-partitioned
c = BinaryValue(Binary{8 constant bytes, [99, 95, 118, 97, 108, 117, 101, 49]}), 
c = NullValue, 
p4s partitioned
c = BinaryValue(Binary{8 constant bytes, [99, 95, 118, 97, 108, 117, 101, 49]}), a = BinaryValue(Binary{"a_value"}), 
c = NullValue, a = BinaryValue(Binary{"a_value"}),

Jul 12 '24 07:07 dpogibelskiy

parquet4s parquet4s copied to clipboard

ProjectionSchema self-inconsistency with partitioned source

parquet4s
parquet4s copied to clipboard