iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Core: Add schema_id to ContentFile/ManifestFile

Open ConeyLiu opened this issue 2 years ago • 14 comments

This is the first part of #4842. Add the schema id to DataFile/DeteFile/ManifestFile and which could be used to evaluate the filter expression based on the schema.

ConeyLiu avatar May 30 '22 06:05 ConeyLiu

Hi @rdblue @kbendick, could you help to review this when you are free? Thanks a lot.

ConeyLiu avatar May 30 '22 06:05 ConeyLiu

@ConeyLiu, thanks for breaking this up, but this is still far too large for a single PR. There's no need to update all of the writers in a single PR. Instead, this should focus on core and API classes and just read null values for the new field. You can add writes later.

rdblue avatar May 30 '22 15:05 rdblue

Hi @rdblue, thanks for the review. I have reverted the changes for the writers. Please take another look, thanks a lot.

ConeyLiu avatar May 31 '22 10:05 ConeyLiu

Is this something we can do or need to wait until V3?

@szehon-ho, this is something we can do because it is backward compatible. Older readers will ignore new fields, so we can add it safely. And if we don't have a schema ID for a column, then we just return null and skip the optimization.

rdblue avatar Jun 07 '22 22:06 rdblue

A little busy recently. Will address the comments tomorrow.

ConeyLiu avatar Jun 09 '22 13:06 ConeyLiu

Thanks, @rdblue @szehon-ho for the review. Comments have been addressed.

ConeyLiu avatar Jun 10 '22 14:06 ConeyLiu

I wonder, is it possible to add a test to try to deserialize an older manifest entry without schema_id?

It seems to need to implement customized V1Writer/V2Writer and those Indexed classes. Is there any suggestion for this?

ConeyLiu avatar Jun 11 '22 13:06 ConeyLiu

Yea there is some limited discussion in this related issue, but I guess no good conclusions: https://github.com/apache/iceberg/issues/2542.

Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.

By the way, change mostly looks good, and it's good you put the new field is optional to avoid the issue like the one mentioned. I just dont know enough about the serialization/deserialization code to be sure if there are any other problems with previous serialized metadata, so was hoping to have a test to verify it. Though can leave to @rdblue to approve if he is confident , and we can tackle the backward compat tests on the side.

szehon-ho avatar Jun 13 '22 21:06 szehon-ho

Thanks @szehon-ho for the review and suggestion.

Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.

I will add the test later.

ConeyLiu avatar Jun 15 '22 12:06 ConeyLiu

Hi @rdblue @szehon-ho, I am sorry for the late update. The compatible test has been added. Hopeful, you could take another look when you are free.

ConeyLiu avatar Jun 29 '22 12:06 ConeyLiu

Hi @szehon-ho thanks for reviewing this again.

Also I noticed, spec-id and schema is already written in the header of each manifest. As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. Wondering at a high level, is it adequate for the optimization you are planning?

What happened after the rewritten data file? It seems like the schema of the manifest file is the current table schema. And those manifest entries in the same manifest file could have a different schema after rewrite.

ConeyLiu avatar Aug 09 '22 14:08 ConeyLiu

@ConeyLiu that's a good question, I think (may be wrong) rewriteDataFiles groups files by partition/partition spec, and may not preserve the old schemas. Ie, all the data files are rewritten with latest schema of that partition spec.

I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.

szehon-ho avatar Aug 09 '22 17:08 szehon-ho

I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.

You are correct. The data file with the new spec after rewrite. We can not benefit from the schema evaluation because we lost the original schema information.

As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests.

In RewriteManifest, we use the current table partition spec or the specified spec with spec ID. I think the schema used in the current space is not the same one as the original schema for the old manifest file. That's because we will rewrite the partition spec when updating the table schema. Please correct me if I am wrong.

ConeyLiu avatar Aug 10 '22 03:08 ConeyLiu

Yea you are right, it seems it will set the latest schema of each spec on the rewritten manifests, so the information is lost if you evolve schemas within a spec.

szehon-ho avatar Aug 10 '22 17:08 szehon-ho

@szehon-ho,@rdblue Any update here?

chenjunjiedada avatar May 04 '23 03:05 chenjunjiedada

Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?

aokolnychyi avatar May 05 '23 19:05 aokolnychyi

Thanks @szehon-ho @aokolnychyi

Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?

We don't need to change those existing evaluators. Just need to create a new SchemaEvaluator by the filter expression and schema, and use it to evaluate the file. You can refer here:

return CloseableIterable.filter(
      open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)),
      entry -> {
        boolean keep = entry != null;
        if (keep && schemaEvaluator != null && entry.file().schemaId() > -1) {
          // evaluate based on the schemaId
          keep = schemaEvaluator.eval(schemasById.get(entry.file().schemaId()));
        }

       return keep &&
            evaluator.eval(entry.file().partition()) &&
            metricsEvaluator.eval(entry.file()) &&
            inPartitionSet(entry.file());
       });

ConeyLiu avatar May 06 '23 00:05 ConeyLiu

Hi @rdblue @szehon-ho @aokolnychyi do you have any time to look at this again?

ConeyLiu avatar Jul 19 '23 14:07 ConeyLiu

Hi @rdblue @szehon-ho @aokolnychyi @RussellSpitzer @nastra, could you help to review this? This should be useful for tables with frequent column additions or deletions.

ConeyLiu avatar Sep 06 '23 11:09 ConeyLiu