iceberg Core: Add schema_id to ContentFile/ManifestFile

This is the first part of #4842. Add the schema id to DataFile/DeteFile/ManifestFile and which could be used to evaluate the filter expression based on the schema.

May 30 '22 06:05 ConeyLiu

Hi @rdblue @kbendick, could you help to review this when you are free? Thanks a lot.

May 30 '22 06:05 ConeyLiu

@ConeyLiu, thanks for breaking this up, but this is still far too large for a single PR. There's no need to update all of the writers in a single PR. Instead, this should focus on core and API classes and just read null values for the new field. You can add writes later.

May 30 '22 15:05 rdblue

Hi @rdblue, thanks for the review. I have reverted the changes for the writers. Please take another look, thanks a lot.

May 31 '22 10:05 ConeyLiu

Is this something we can do or need to wait until V3?

@szehon-ho, this is something we can do because it is backward compatible. Older readers will ignore new fields, so we can add it safely. And if we don't have a schema ID for a column, then we just return null and skip the optimization.

Jun 07 '22 22:06 rdblue

A little busy recently. Will address the comments tomorrow.

Jun 09 '22 13:06 ConeyLiu

Thanks, @rdblue @szehon-ho for the review. Comments have been addressed.

Jun 10 '22 14:06 ConeyLiu

I wonder, is it possible to add a test to try to deserialize an older manifest entry without schema_id?

It seems to need to implement customized V1Writer/V2Writer and those Indexed classes. Is there any suggestion for this?

Jun 11 '22 13:06 ConeyLiu

Yea there is some limited discussion in this related issue, but I guess no good conclusions: https://github.com/apache/iceberg/issues/2542.

Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.

By the way, change mostly looks good, and it's good you put the new field is optional to avoid the issue like the one mentioned. I just dont know enough about the serialization/deserialization code to be sure if there are any other problems with previous serialized metadata, so was hoping to have a test to verify it. Though can leave to @rdblue to approve if he is confident , and we can tackle the backward compat tests on the side.

Jun 13 '22 21:06 szehon-ho

Thanks @szehon-ho for the review and suggestion.

Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.

I will add the test later.

Jun 15 '22 12:06 ConeyLiu

Hi @rdblue @szehon-ho, I am sorry for the late update. The compatible test has been added. Hopeful, you could take another look when you are free.

Jun 29 '22 12:06 ConeyLiu

Hi @szehon-ho thanks for reviewing this again.

Also I noticed, spec-id and schema is already written in the header of each manifest. As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. Wondering at a high level, is it adequate for the optimization you are planning?

What happened after the rewritten data file? It seems like the schema of the manifest file is the current table schema. And those manifest entries in the same manifest file could have a different schema after rewrite.

Aug 09 '22 14:08 ConeyLiu

@ConeyLiu that's a good question, I think (may be wrong) rewriteDataFiles groups files by partition/partition spec, and may not preserve the old schemas. Ie, all the data files are rewritten with latest schema of that partition spec.

I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.

Aug 09 '22 17:08 szehon-ho

I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.

You are correct. The data file with the new spec after rewrite. We can not benefit from the schema evaluation because we lost the original schema information.

As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests.

In RewriteManifest, we use the current table partition spec or the specified spec with spec ID. I think the schema used in the current space is not the same one as the original schema for the old manifest file. That's because we will rewrite the partition spec when updating the table schema. Please correct me if I am wrong.

Aug 10 '22 03:08 ConeyLiu

Yea you are right, it seems it will set the latest schema of each spec on the rewritten manifests, so the information is lost if you evolve schemas within a spec.

Aug 10 '22 17:08 szehon-ho

@szehon-ho，@rdblue Any update here?

May 04 '23 03:05 chenjunjiedada

Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?

May 05 '23 19:05 aokolnychyi

Thanks @szehon-ho @aokolnychyi

Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?

We don't need to change those existing evaluators. Just need to create a new SchemaEvaluator by the filter expression and schema, and use it to evaluate the file. You can refer here:

return CloseableIterable.filter(
      open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)),
      entry -> {
        boolean keep = entry != null;
        if (keep && schemaEvaluator != null && entry.file().schemaId() > -1) {
          // evaluate based on the schemaId
          keep = schemaEvaluator.eval(schemasById.get(entry.file().schemaId()));
        }

       return keep &&
            evaluator.eval(entry.file().partition()) &&
            metricsEvaluator.eval(entry.file()) &&
            inPartitionSet(entry.file());
       });

May 06 '23 00:05 ConeyLiu

Hi @rdblue @szehon-ho @aokolnychyi do you have any time to look at this again?

Jul 19 '23 14:07 ConeyLiu

Hi @rdblue @szehon-ho @aokolnychyi @RussellSpitzer @nastra, could you help to review this? This should be useful for tables with frequent column additions or deletions.

Sep 06 '23 11:09 ConeyLiu

iceberg iceberg copied to clipboard

Core: Add schema_id to ContentFile/ManifestFile

iceberg
iceberg copied to clipboard