iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Support non-optional union types and column projection in complex union for Avro

Open yiqiangin opened this issue 2 years ago • 2 comments

This PR consists of two parts

  • the support for non-optional union types which is cherry picked from the unmerged PR https://github.com/apache/iceberg/pull/4242
  • the support for column projection in complex union which is an extension work of the previous PR

In Iceberg, there are two types of schema: table schema and file schema. Table schema refers to the schema defined in Iceberg table format. File schema refers to the schema of the data stored in underlying data file. If the data file is defined in Avro format, file schema is also referred as Avro file schema. The complex union refers to a union consisting of multiple types. While the union type is natively supported in Avro file schema, there is no union type defined in Iceberg table format. Therefore, the complex union is represented by a struct with multiple fields in Iceberg table schema. Each field in the struct is associated with a type in the union. In normal case, the number of fields in the struct equals to the number of types in the union plus one (for the tag field). In case of the column projection on union type in the query, the fields of the struct in Iceberg table schema are pruned according to the types projected in the query. In contrast, the union in Avro file schema is not pruned in case of column projection, as all the types in the union are needed to read the data from Avro data file successfully. Also the value readers to read the data of all types in the union from Avro data file are created based on the types in the union from Avro file schema and the fields in the struct of Iceberg table schema. The major problem to be solved here is to correlate the type in Avro file schema with the corresponding field of the struct in Iceberg table schema, especially in case that only a part of fields exist in the struct of Iceberg table schema with column projection.

The main idea of the solution is as follows:

  • Build the mapping from the type name of the union in Avro file schema to the id of the corresponding field of the struct in Iceberg table schema.
  • When value readers are created, find the corresponding field in Iceberg table schema for a type in the union of Avro file schema with the id stored in the mapping which key is the name of the type in Avro file schema.

The details of the implementation are as follows:

  • The mapping from the field name in Avro file schema to the field id in Iceberg schema is derived during the conversion from Avro file schema to Iceberg table schema in the function of AvroSchemaUtil.convertToDeriveNameMapping and the class of SchemaToType.
  • The mapping of direct child fields of an Avro file schema field is stored as a property named AvroFieldNameToIcebergId in this Avro file schema field, therefore it can be easily accessed when Avro schema is traversed to generate the correspond readers to read Avro data file.
  • In case of union, the key of the mapping is the name of the branch in the union.
  • In case of complex union, the code of AvroSchemaWithTypeVisitor.visitUnion() first gets the mapping from the property of Avro file schema, then get the field id in Iceberg table schema using the type name in Avro file schema, finally it uses the field id to get the field type in Iceberg table schema:
    • if the corresponding field in Iceberg table schema exists, the field is used to create the reader together with Avro file schema node;
    • if the field for the given field id does not exist in Iceberg table schema (which means this field is not projected in Iceberg schema), a pseudo branch type is created based on the corresponding Avro file schema node to facilitate the creation of the reader.
  • In the class of UnionReader, the rows read from Avro data file are filtered according to the fields existing in Iceberg table schema.

yiqiangin avatar Sep 04 '22 17:09 yiqiangin

It seems references to "Avro schema" in the description are ambiguous. Could you disambiguate them? For example when saying In contrast, the union schema of Avro schema is not pruned in case of column projection, it is not clear which Avro schema you are referring to. This might apply to other schema type references. File schema is an example of an unambiguous reference.

wmoustafa avatar Sep 21 '22 21:09 wmoustafa

It seems references to "Avro schema" in the description are ambiguous. Could you disambiguate them? For example when saying In contrast, the union schema of Avro schema is not pruned in case of column projection, it is not clear which Avro schema you are referring to. This might apply to other schema type references. File schema is an example of an unambiguous reference.

Good point. The description is revised to remove the ambigulity.

yiqiangin avatar Sep 22 '22 17:09 yiqiangin