Flate stream decoding issues
Hi,
I'm running into another issue decoding a flate stream. First and foremost is the decoded data doesn't appear to be right. I have attached what I'm getting and what I expect. Also with the current buffer size I don't seem to be reading the full length I expect from this stream. If I change buf size to something 120000 divides into evenly, like 1000, then it reads the expected amount of 120000.
Below is a very trimmed down version of what I'm doing. The PDF file causing the problem is also attached.
#include <iostream>
#include <fstream>
#include <string>
#include "PDFHummus/PDFParser.h"
#include "PDFHummus/InputFile.h"
#include "PDFHummus/PDFStreamInput.h"
#include "PDFHummus/IByteReader.h"
#include "PDFHummus/EStatusCode.h"
using namespace std;
using namespace PDFHummus;
void decodeStream(char *path);
int main(int count, char* args[]) {
if (count < 2) {
cerr << "PDF file required" << endl;
return 1;
}
if (count == 2) {
decodeStream(args[1]);
}
return 0;
}
void decodeStream(char *path) {
PDFParser parser;
InputFile pdfFile;
EStatusCode status = pdfFile.OpenFile(path);
if(status == eSuccess) {
status = parser.StartPDFParsing(pdfFile.GetInputStream());
if(status == eSuccess) {
// Parse image object
PDFObject* streamObj = parser.ParseNewObject(7);
if (streamObj != NULL
&& streamObj->GetType() == PDFObject::ePDFObjectStream) {
PDFStreamInput* stream = ((PDFStreamInput*)streamObj);
IByteReader* reader = parser.StartReadingFromStream(stream);
if (!reader) {
cout << "Couldn't create reader\n";
}
ofstream os("image.data", ofstream::binary);
Byte buffer[1024];
LongBufferSizeType total = 0;
while(reader->NotEnded()) {
LongBufferSizeType readAmount = reader->Read(buffer,1024);
os.write((char*)buffer, readAmount);
total += readAmount;
}
os.close();
cout << "Total read: " << total << "\n";
cout << "Expected read: " << 120000 << "\n";
}
}
}
}
oh this was lovely!!!!111 This found two issues:
- Several issues with png predictors, but mostly predictor 15 which read the function from the wrong byte. most of them also 0'd the wrong starter byte. wow. for 6 years with no interruption.
- Flate decode may have a false-negative NotEnded(), in that there's sometimes some garbage after the flate stream is ended, which translates to reading nothing. i should prep the whole code to just live with it (cant think i can fix flate decode without performance penalty). let's start here.
anyways - both issues are corrected with commit f179eea69c1b3970336e8a989142246a56f12f02.
oh. b.t.w, i hope that you dont mind but i added your code and file as a new test into the code. if you do - let me know and i'll remove them.
You rock Gal, works great!
Your gonna hate me but I found a few more. Using the same code as above, but with different objects here are two sets that don't seem to be decoding correctly either. Each zip contains a "decoded" folder with the results from PDF-Writer and a "correct" folder with the expected results. I named each data file as pageXX-objYY.
In the case of set2, obj731 I can't seem to decode. Seems to be stuck in a loop and never finished.
Btw thanks so much for taking a look at this so quickly. Really awesome library you've created!
look. how about you help me here?
I would be happy to help but although I've spent a lot of time with other parts of the PDF spec, I've spent very little with filters and PNG predictors so will take me quite a while to get up to speed.
same here
Ok set2 actually seems to have a problem with ascii85 decoding so looking into that and if I can fix I'll post results. set1 I can probably use libpng to do what I need so probably I won't get to that right now.
Ok fixed the issue to ascii85 decoding and made a pull request
thanks man. RE set1. no need for libpng here. i implemented the predictors, just seems like i got some bugs. i see that they normally use 15. this is probably where the bugs are: https://github.com/galkahana/PDF-Writer/blob/master/PDFWriter/InputPredictorPNGOptimumStream.cpp#L100
i'll take a look sometimes.
Gal.
No problem. I think I have a fix for one case with the PNG predictors now too which I created a pull request for. If it's right a similar change needed for InputPredictorPNGSubStream and probably the others too but I don't have any samples to test with yet.
thanks. i reckon we could probably reuse InputPredictorPNGOptimumStream in all of them, as png requires the first byte to state the function type anyways. i'll sometime look into it and if so - reuse. thanks
I agree that makes a lot of sense