iceberg Spec: Inconsistency around files

When building the Manifest mappers for Python, @rdblue noticed that the added_data_files_count should be added_files_count according to the spec.

However, this field is written in Java as added_data_files_count to Avro by the Java implementation:
https://github.com/apache/iceberg/blob/8104769f81ba79338fd3c94d5bd9267f22d31ed7/api/src/main/java/org/apache/iceberg/ManifestFile.java#L44-L49

Luckily this doesn't affect the reading/writing because it is position based.

However, it is confusing. I think we should resolve this. We could either do this by changing the Java impl, which probably works, but we could also change the spec. I know that this isn't something lightweight, but we could take it into consideration.

I think we should also update the references in the code to {added,existing,deleted}_data_files_count to make everything consistent and avoid confusion in the future.

Jul 22 '22 10:07 Fokko

@Fokko, sorry, but I think I was wrong in my initial review. I think instead of renaming these, we actually want to rename the fields that we write so that they match the spec. Because we use the same column for both DataFile and DeleteFile to count the number of ADDED rows, we should use a neutral name so added_files_count is correct.

Looks like the problem here is that the original added_data_files_count is used instead of added_files_count, not that the variable names don't match.

This is what I meant by this comment:

I think we can also update the field names. Those won't break anything because the only names that matter are the ones in the read schema.

I'm not sure why I thought the constant renames looked good though, since that would rename the constants to match the out-of-date manifest field names, which we should update instead.

Aug 07 '22 22:08 rdblue

Thanks for the review @rdblue 🙌

Oct 05 '23 19:10 Fokko

Spec: Inconsistency around files_count