iceberg
iceberg copied to clipboard
Core: Add schema_id to ContentFile/ManifestFile
This is the first part of #4842. Add the schema id to DataFile/DeteFile/ManifestFile and which could be used to evaluate the filter expression based on the schema.
Hi @rdblue @kbendick, could you help to review this when you are free? Thanks a lot.
@ConeyLiu, thanks for breaking this up, but this is still far too large for a single PR. There's no need to update all of the writers in a single PR. Instead, this should focus on core and API classes and just read null values for the new field. You can add writes later.
Hi @rdblue, thanks for the review. I have reverted the changes for the writers. Please take another look, thanks a lot.
Is this something we can do or need to wait until V3?
@szehon-ho, this is something we can do because it is backward compatible. Older readers will ignore new fields, so we can add it safely. And if we don't have a schema ID for a column, then we just return null and skip the optimization.
A little busy recently. Will address the comments tomorrow.
Thanks, @rdblue @szehon-ho for the review. Comments have been addressed.
I wonder, is it possible to add a test to try to deserialize an older manifest entry without schema_id?
It seems to need to implement customized V1Writer/V2Writer and those Indexed
classes. Is there any suggestion for this?
Yea there is some limited discussion in this related issue, but I guess no good conclusions: https://github.com/apache/iceberg/issues/2542.
Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.
By the way, change mostly looks good, and it's good you put the new field is optional to avoid the issue like the one mentioned. I just dont know enough about the serialization/deserialization code to be sure if there are any other problems with previous serialized metadata, so was hoping to have a test to verify it. Though can leave to @rdblue to approve if he is confident , and we can tackle the backward compat tests on the side.
Thanks @szehon-ho for the review and suggestion.
Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.
I will add the test later.
Hi @rdblue @szehon-ho, I am sorry for the late update. The compatible test has been added. Hopeful, you could take another look when you are free.
Hi @szehon-ho thanks for reviewing this again.
Also I noticed, spec-id and schema is already written in the header of each manifest. As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. Wondering at a high level, is it adequate for the optimization you are planning?
What happened after the rewritten data file? It seems like the schema of the manifest file is the current table schema. And those manifest entries in the same manifest file could have a different schema after rewrite.
@ConeyLiu that's a good question, I think (may be wrong) rewriteDataFiles groups files by partition/partition spec, and may not preserve the old schemas. Ie, all the data files are rewritten with latest schema of that partition spec.
I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.
I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.
You are correct. The data file with the new spec after rewrite. We can not benefit from the schema evaluation because we lost the original schema information.
As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests.
In RewriteManifest, we use the current table partition spec or the specified spec with spec ID. I think the schema used in the current space is not the same one as the original schema for the old manifest file. That's because we will rewrite the partition spec when updating the table schema. Please correct me if I am wrong.
Yea you are right, it seems it will set the latest schema of each spec on the rewritten manifests, so the information is lost if you evolve schemas within a spec.
@szehon-ho,@rdblue Any update here?
Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?
Thanks @szehon-ho @aokolnychyi
Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?
We don't need to change those existing evaluators. Just need to create a new SchemaEvaluator
by the filter expression and schema, and use it to evaluate the file. You can refer here:
return CloseableIterable.filter(
open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)),
entry -> {
boolean keep = entry != null;
if (keep && schemaEvaluator != null && entry.file().schemaId() > -1) {
// evaluate based on the schemaId
keep = schemaEvaluator.eval(schemasById.get(entry.file().schemaId()));
}
return keep &&
evaluator.eval(entry.file().partition()) &&
metricsEvaluator.eval(entry.file()) &&
inPartitionSet(entry.file());
});
Hi @rdblue @szehon-ho @aokolnychyi do you have any time to look at this again?
Hi @rdblue @szehon-ho @aokolnychyi @RussellSpitzer @nastra, could you help to review this? This should be useful for tables with frequent column additions or deletions.