metadata-extractor File identification after file signature

File identification after file signature

Open payton opened this issue 6 years ago • 3 comments

This issue was mentioned in #288 where some TIFF file formats have the same signature (DNG treated as ARW). ExifTool handles these first as generic TIFF files and then later changes the file format once it finds a property that reliably identifies it. This issue also come up in file formats that use ZIP.

It may be worth while to include the FileType within Metadata so that it can be accessed throughout the extraction process. Any thoughts/opinions on this issue and how it would be implemented?

Sep 21 '17 00:09 payton

As implemented, metadata-extractor needs to know the type early to decide which concrete objects to create. If parsed data changes the Directory you originally thought should be used, it's implied you would need to change the Directory object afterwards. That seems difficult in the current implementation; you would have to create the new directory, copy the values, and remove the old directory after it was already in the tree. It really depends on how deep into the parsing process you have to go to figure out the real type whether this is feasible in the current implementation.

Maybe I'm thinking about this wrong, so grain of salt. Perhaps determining the types doesn't necessarily mean a directory change after the fact is needed, but if even one does I'm not sure how you handle it.

Sep 21 '17 01:09 kwhopper

What is certain is that the current ARW detection is wrong, see #217. Many of the RAW formats are in fact custom TIFF implementations, and will as such have the same magic bytes as TIFF. As I see it, those formats should be considered generic TIFFs at first, and then differentiated after further inspection.

Sep 21 '17 01:09 Nadahar

I've been looking more into this recently, specifically for TIFFs. ExifTool's logic for differentiating file types seems somewhat unsettling as there are a lot of strange conditions. For example, there is an if(identifier is 0x2a and offset is greater than or equal to 16) followed by else if(identifier is 0x55 and file type is x or y or z). Note that I am not familiar with Perl, but I tried to interpret it to the best of my abilities. The main ExifTool.pm file is nearly 8000 lines, so it can be a bit confusing.

The documentation that blauwers posted in the issue Nadahar referenced states:

0002h     1 word   TIFF "version number". This version number
					 never changed and the value (42) was choosen
					 for its deep philosophical value. In fact, if
					 the version number ever changes, this means
					 that radical changes to the TIFF format have
					 been made, and a TIFF reader should give up
					 immediately.
					 You can consider this word to be a part of the
					 header ID.

This leads me to think we only differentiate TIFF-based file types by the version number (and NOT the offset).

So this is what I see the logic for TIFF identification as being:

Detect TIFF file w/ FileTypeDetector based on 0x4949 and 0x4d4d ("II" and "MM" respectively)
Upon reading the version number (called 'tiffMarker' in TiffReader.java), begin changing the file type (Olympus and Panasonic have specified tiff markers in ExifTiffHandler.java)
Further file type changes can occur upon encountering makernotes

DNG would be a special case where the only way to determine that file type is the existence of a DNG Version tag.

Does it seem right to take a different path than ExifTool, or should we try to mimic ExifTool's logic? Please don't hesitate to voice any concerns/criticisms with this method.

Nov 14 '17 23:11 payton

metadata-extractor metadata-extractor copied to clipboard

File identification after file signature

metadata-extractor
metadata-extractor copied to clipboard