PDF-Writer Flate stream decoding issues

Hi,

I'm running into another issue decoding a flate stream. First and foremost is the decoded data doesn't appear to be right. I have attached what I'm getting and what I expect. Also with the current buffer size I don't seem to be reading the full length I expect from this stream. If I change buf size to something 120000 divides into evenly, like 1000, then it reads the expected amount of 120000.

Below is a very trimmed down version of what I'm doing. The PDF file causing the problem is also attached.

files.zip

#include <iostream>
#include <fstream>
#include <string>
#include "PDFHummus/PDFParser.h"
#include "PDFHummus/InputFile.h"
#include "PDFHummus/PDFStreamInput.h"
#include "PDFHummus/IByteReader.h"
#include "PDFHummus/EStatusCode.h"

using namespace std;
using namespace PDFHummus;

void decodeStream(char *path);

int main(int count, char* args[]) {
    if (count < 2) {
        cerr << "PDF file required" << endl;
        return 1;
    }

    if (count == 2) {
        decodeStream(args[1]);
    }

    return 0;
}

void decodeStream(char *path) {
    PDFParser parser;
    InputFile pdfFile;
    EStatusCode status = pdfFile.OpenFile(path);
    if(status == eSuccess) {
        status = parser.StartPDFParsing(pdfFile.GetInputStream());
        if(status == eSuccess) {
            // Parse image object
            PDFObject* streamObj = parser.ParseNewObject(7);
            if (streamObj != NULL
                && streamObj->GetType() == PDFObject::ePDFObjectStream) {
                PDFStreamInput* stream = ((PDFStreamInput*)streamObj);
                IByteReader* reader = parser.StartReadingFromStream(stream);
                if (!reader) {
                    cout << "Couldn't create reader\n";
                }

                ofstream os("image.data", ofstream::binary);
                Byte buffer[1024];
                LongBufferSizeType total = 0;
                while(reader->NotEnded()) {
                    LongBufferSizeType readAmount = reader->Read(buffer,1024);
                    os.write((char*)buffer, readAmount);
                    total += readAmount;
                }

                os.close();

                cout << "Total read: " << total << "\n";
                cout << "Expected read: " << 120000 << "\n";
            }
        }
    }
}

Feb 19 '17 06:02 tiliasagen

oh this was lovely!!!!111 This found two issues:

Several issues with png predictors, but mostly predictor 15 which read the function from the wrong byte. most of them also 0'd the wrong starter byte. wow. for 6 years with no interruption.
Flate decode may have a false-negative NotEnded(), in that there's sometimes some garbage after the flate stream is ended, which translates to reading nothing. i should prep the whole code to just live with it (cant think i can fix flate decode without performance penalty). let's start here.

anyways - both issues are corrected with commit f179eea69c1b3970336e8a989142246a56f12f02.

Feb 25 '17 13:02 galkahana

oh. b.t.w, i hope that you dont mind but i added your code and file as a new test into the code. if you do - let me know and i'll remove them.

Feb 25 '17 13:02 galkahana

You rock Gal, works great!

Your gonna hate me but I found a few more. Using the same code as above, but with different objects here are two sets that don't seem to be decoding correctly either. Each zip contains a "decoded" folder with the results from PDF-Writer and a "correct" folder with the expected results. I named each data file as pageXX-objYY.

In the case of set2, obj731 I can't seem to decode. Seems to be stuck in a loop and never finished.

set1.zip set2.zip

Feb 25 '17 15:02 tiliasagen

Btw thanks so much for taking a look at this so quickly. Really awesome library you've created!

Feb 25 '17 15:02 tiliasagen

look. how about you help me here?

Feb 25 '17 15:02 galkahana

I would be happy to help but although I've spent a lot of time with other parts of the PDF spec, I've spent very little with filters and PNG predictors so will take me quite a while to get up to speed.

Feb 25 '17 16:02 tiliasagen

same here

Feb 25 '17 16:02 galkahana

Ok set2 actually seems to have a problem with ascii85 decoding so looking into that and if I can fix I'll post results. set1 I can probably use libpng to do what I need so probably I won't get to that right now.

Feb 26 '17 11:02 tiliasagen

Ok fixed the issue to ascii85 decoding and made a pull request

Feb 26 '17 12:02 tiliasagen

thanks man. RE set1. no need for libpng here. i implemented the predictors, just seems like i got some bugs. i see that they normally use 15. this is probably where the bugs are: https://github.com/galkahana/PDF-Writer/blob/master/PDFWriter/InputPredictorPNGOptimumStream.cpp#L100

i'll take a look sometimes.

Gal.

Feb 26 '17 13:02 galkahana

No problem. I think I have a fix for one case with the PNG predictors now too which I created a pull request for. If it's right a similar change needed for InputPredictorPNGSubStream and probably the others too but I don't have any samples to test with yet.

Feb 26 '17 18:02 tiliasagen

thanks. i reckon we could probably reuse InputPredictorPNGOptimumStream in all of them, as png requires the first byte to state the function type anyways. i'll sometime look into it and if so - reuse. thanks

Mar 04 '17 09:03 galkahana

I agree that makes a lot of sense

Mar 04 '17 16:03 tiliasagen