docconv Support .pages files

Support .pages files

Open mish15 opened this issue 9 years ago • 11 comments

Can we easily support the .pages extension?

Apr 06 '15 01:04 mish15

Not as straight forward as first thought. It is just a zip file, but inside there are Apple's IWS files rather than XML. IWA files are a protobuf stream compressed with snappy - sort of.

http://stackoverflow.com/questions/27454317/decompressing-snappy-files-missing-stream-identifier-chunk-and-crc-32c-checksum

https://github.com/google/protobuf https://code.google.com/p/snappy-go/

Apr 26 '15 09:04 oprimus

The snappy-go implementation doesn't seem to be compatible with Apple's butchered implementation. I'm getting over the missing stream identifier by prepending the reader: snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"), file))

The problem now appears to be that Apple is using the old COPY_4 tag which the snappy golang library doesn't support (as in it detects it and says "unsupported COPY_4 tag"). All other golang snappy libraries appear to be based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in other languages. In particular https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c. However it's now saying that the input is corrupt so there must be something else which I can't track down.

At this point I've seen nobody successfully reading these out there so they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C implementation of snappy to see if that reads it. If it doesn't then I'm not sure where to go next.

Apr 26 '15 14:04 oprimus

Does this help? .iwa seem to be the same https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md

Looks like snappy is kind of followed, but not really.

"they do not include the required Stream Identifier chunk, and compressed chunks do not include a CRC-32C checksum. The stream is composed of contiguous chunks prefixed by a 4 byte header. The first byte indicates the chunk type, which in practice is always 0 for iWork, indicating a Snappy compressed chunk. The next three bytes are interpreted as a 24-bit little-endian integer indicating the length of the chunk. The 4 byte header is not included in the chunk length."

On Monday, 27 April 2015, oprimus [email protected] wrote:

The snappy-go implementation doesn't seem to be compatible with Apple's butchered implementation. I'm getting over the missing stream identifier by prepending the reader: snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"), file))

The problem now appears to be that Apple is using the old COPY_4 tag which the snappy golang library doesn't support (as in it detects it and says "unsupported COPY_4 tag"). All other golang snappy libraries appear to be based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in other languages. In particular https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c. However it's now saying that the input is corrupt so there must be something else which I can't track down.

At this point I've seen nobody successfully reading these out there so they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C implementation of snappy to see if that reads it. If it doesn't then I'm not sure where to go next.

— Reply to this email directly or view it on GitHub https://github.com/sajari/sajari-convert/issues/8#issuecomment-96391391.

Hamish Ogilvy Sajari Pty Ltd _t: +61 (_0) 414 658 353 | e: *[email protected] *w: www.sajari.com

Apr 26 '15 18:04 mish15

Any trap on where the corrupt err comes from? e.g. Is it in the header read or chunk processing loop? You're hardcoding the decoded length in the stream identifier, which is the first check for corruption.

From what I can read it's definitely doable. Looks like it's in the snappy "framing format", not pure snappy, so probably needs to be read and decoded in chunks instead of a single block as per https://code.google.com/p/snappy/source/browse/trunk/framing_format.txt

Can you upload the WIP branch?

On Monday, 27 April 2015, oprimus [email protected] wrote:

The snappy-go implementation doesn't seem to be compatible with Apple's butchered implementation. I'm getting over the missing stream identifier by prepending the reader: snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"), file))

The problem now appears to be that Apple is using the old COPY_4 tag which the snappy golang library doesn't support (as in it detects it and says "unsupported COPY_4 tag"). All other golang snappy libraries appear to be based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in other languages. In particular https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c. However it's now saying that the input is corrupt so there must be something else which I can't track down.

At this point I've seen nobody successfully reading these out there so they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C implementation of snappy to see if that reads it. If it doesn't then I'm not sure where to go next.

— Reply to this email directly or view it on GitHub https://github.com/sajari/sajari-convert/issues/8#issuecomment-96391391.

Hamish Ogilvy Sajari Pty Ltd _t: +61 (_0) 414 658 353 | e: *[email protected] *w: www.sajari.com

Apr 26 '15 20:04 mish15

Commit 7ed3c56b0dd00c22c1c74e3339bd9509868a8334 Snappy compression needs to be altered to disable checksums for this to work (See below). Otherwise it gets to the point where we can get the uncompressed stream and find the archive length of the first object. However when trying to unmarshal the ArchiveInfo I get an "unexpected EOF".

 vi ~/go/src/code.google.com/p/snappy-go/snappy/decode.go

                 case chunkTypeCompressedData:
                        // Section 4.2. Compressed data (chunk type 0x00).
                        //if chunkLen < checksumSize {
                        //      r.err = ErrCorrupt
                        //      return 0, r.err
                        //}
                        buf := r.buf[:chunkLen]
                        if !r.readFull(buf) {
                                return 0, r.err
                        }
                        //checksum := uint32(buf[0]) | uint32(buf[1])<<8 | uint32(buf[2])<<16 | uint32(buf[3])<<24
                        //buf = buf[checksumSize:]

                        n, err := DecodedLen(buf)
                        if err != nil {
                                r.err = err
                                return 0, r.err
                        }
                        if n > len(r.decoded) {
                                r.err = ErrCorrupt
                                return 0, r.err
                        }
                        if _, err := Decode(r.decoded, buf); err != nil {
                                fmt.Println("decode error", err)
                                r.err = err
                                return 0, r.err
                        }
                        //if crc(r.decoded[:n]) != checksum {
                        //      fmt.Println("checksum")
                        //      r.err = ErrCorrupt
                        //      return 0, r.err
                        //}
                        r.i, r.j = 0, n
                        continue

Apr 27 '15 02:04 oprimus

The snappy tests are failing (no doubt due to the changes you mention here not being compatible with the tests). I have marked the failing tests to be skipped for the moment, but we really need to fix this.

Sep 26 '15 22:09 dhowden

I see that you include the three cases, if a quickview pdf is available, an xml or the protobuffer iwa.

Does any of this work for iworks'14 files?

Jan 31 '18 13:01 gonedjur

Best thing to do is to test it and see. The pages format is pretty hacky

Jan 31 '18 21:01 mish15

Looks like a no.

2018/02/01 14:39:28 Received file: t.pages (application/vnd.apple.pages) archiveInfo: 2018/02/01 14:39:28 {"body":"","meta":{},"msecs":2}

Edit:

I wonder how these guys do it. https://cloudconvert.com/formats/document/pages

They manage 5.5 in some way. Only guys I've seen to do it...

Feb 01 '18 15:02 gonedjur

We welcome pull requests! :)

Feb 01 '18 20:02 mish15

It’s definitely possible, just need to play with the encoding. It wasn’t documented anywhere well from memory, but may be possibly these days

Feb 01 '18 20:02 mish15

docconv docconv copied to clipboard

Support .pages files

docconv
docconv copied to clipboard