iWorkFileFormat icon indicating copy to clipboard operation
iWorkFileFormat copied to clipboard

Identify file type

Open leochou0729 opened this issue 5 years ago • 5 comments

Hello, I wonder if there is an easy way to identify the iworks file type, because the file command can only tell it's a zip archive. I don't need to extract file content. Any suggestion?

leochou0729 avatar Sep 27 '19 09:09 leochou0729

Besides the file extensions?

ccharlton avatar Sep 27 '19 20:09 ccharlton

In Objective-C/Swift you could also use NSWorkspace's type(ofFile:) method. However, for me this method proved to be quite unreliable and like ccharlton I prefer using the extension

ediathome avatar Sep 30 '19 14:09 ediathome

Checking the file extension is a little weak. I'd like to block sensitive information leaking outside, so I need to send different types of file to proper recognition engines. Someone can just remove the file extension or change it to something else to circumvent examination. The NSWorkspace's type(ofFile:) method seems to work as the file command, which only reports it's a zip archive.

leochou0729 avatar Oct 08 '19 01:10 leochou0729

This sounds tricky. I guess you will need to dig into the file, e.g. to check if the contents of the zip archive conform to the file format. Maybe a good starting point:

https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/FileSystemProgrammingGuide/FileSystemOverview/FileSystemOverview.html

Maybe also this thread on using the mdls command over here is interesting:

https://superuser.com/questions/323599/is-it-possible-to-query-the-launch-services-database-for-applications-that-will

ediathome avatar Oct 14 '19 17:10 ediathome

The only way to determine the file format is to speculatively parse. Blanking all of the plists and stripping the non-.iwa files preserves the correctness in the respective applications and thus cannot be used in the process.

In /Index/Document.iwa the root DocumentArchive message is always of type 1 (and always message index 1). This message is sufficient.

In the 11.2 apps, the required fields are:

// Keynote optional fields 4
message DocumentArchive {
  required .TSA.DocumentArchive super = 3;
  required .TSP.Reference show = 2;
}

// Numbers optional fields 1, 3, 7, 9, 10, 11, 12
message DocumentArchive {
  required .TSA.DocumentArchive super = 8;
  required .TSP.Reference stylesheet = 4;
  required .TSP.Reference sidebar_order = 5;
  required .TSP.Reference theme = 6;
}

// Pages optional fields 2 - 7, 11 - 14, 16, 17, 20, 21, 30 - 49
message DocumentArchive {
  required .TSA.DocumentArchive super = 15;
}

So the following suffices:

- find `Index/Document.iwa` in the container, de-frame and find message 1 of type 1
- do a shallow parse of the protobuf message

-- if field 15 is present: file is of type "PAGES"
-- else if field 2 is present: file is of type "KEYNOTE"
-- else: file is of type "NUMBERS"

@obriensp feel free to add this to the README if you are still updating / interested. Most of the iWork file format ecosystem projects still refer to the notes.

SheetJSDev avatar Mar 25 '22 20:03 SheetJSDev