pdfparser
pdfparser copied to clipboard
PNG Images with FlateDecode are corrupt
I try to extract all XObject (images) from the test pdf test.pdf
Only the not "FlateDecode" jpg are correct decoded (raw jpg data).
the other images are just 0x00... or 0xFF... byte garbage, I think maybe the plain gzuncompress
call is not enough and the DecodeParms
/DecodeParms << /Predictor 15 /Colors 1 /Columns 1200 /BitsPerComponent 8>>
must be respected too.
I found this old piece of code https://github.com/frapi/frapi/blob/ef50192b6cf336ef2c4c0fc3ad122194e3d0ecde/src/frapi/library/Zend/Pdf/Filter/Compression.php
but without any success.
Thanks for your analyze @k00ni, you were right (except for gzuncompress) and I found the way to decode with the frapi library source code. I'll try to write a PR when I have time. If you still need help don't hesitate to ask.
the other images are just 0x00... or 0xFF... byte garbage
This was not garbage but Netpbm image format (http://davis.lbl.gov/Manuals/NETPBM/doc/index.html)
Thanks for your analyze @k00ni, you were right (except for gzuncompress) and I found the way to decode with the frapi library source code. I'll try to write a PR when I have time. If you still need help don't hesitate to ask.
can you share the code for decode with the frapi library?
@r4zielrc in short words :
- take the Frapi (Framework)
Compression.php
file mentioned by @brannow - In class definition remove the
abstract
type andimplements Zend_Pdf_Filter_Interface
- Replace all
Zend_Pdf_Exception
by standardException
- Change
_applyDecodeParams
fromprotected
topublic
- Where getting $params value in each method (lines 73, 97, 119, 142), cast variable to
int
and add->getContent()
at end - Call
Zend_Pdf_Filter_Compression::_applyDecodeParams
with-
$imageObject->getContent()
as$data
-
$imageObject->getHeaders()['DecodeParams']->getElements()
as$params
-
Do it only if DecodeParams
exists, else it can be simple jpeg
image which not need this transformation
@ajira86 I got the Compression.php working and I convert ppm/pgm raw to GdImage but do you know how to detect ppm/pgm and the right format? I currently mapped the input to the same output formats which are created when I use the linux command line tool pdfimage
from xpdf
package:
if ($bitsPerComponent === 8) {
if ($colors === 3) {
$magic = 'P6';
$extention = 'ppm';
} elseif ($colors === 1) {
$magic = 'P5';
$extention = 'pgm';
}
}
for the P5
I think the relevant part could also be /DeviceGray
:
<</Type /XObject
/Subtype /Image
/Width 200
/Height 200
/ColorSpace /DeviceGray
/BitsPerComponent 8
/Filter /FlateDecode
/DecodeParms <</Predictor 15 /Colors 1 /BitsPerComponent 8 /Columns 200>>
/Length 242>>
this is my P6
data:
<</Type /XObject
/Subtype /Image
/Width 200
/Height 200
/SMask 28 0 R
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter /FlateDecode
/DecodeParms <</Predictor 15 /Colors 3 /BitsPerComponent 8 /Columns 200>>
/Length 1090>>
stream
P5
is a Mask (/SMask 28 0 R
) for P6
but this is also not supported by pdfimage
and currently not relevant for my use case.
If the detection of the PPM format is confirmed I can provide a patch for the library. Currently I only have a hack which is fixing the output of the library.
@aheissenberger in my case I was first using PPM (P6) which don't needed to be decoded. The BitsPerComponent
was present but only in general header. The DeodeParms
object didn't exists for my case, so, the Colors
attribute was not present PDF data .
So, it seams that DecodeParms
is optional. If not present, you don't need any decoding operation to do and only prepend the missing header to raw data.
my P6
data :
<</Type /XObject
/Subtype /Image
/Width 1090
/Height 1090
/ColorSpace /DeviceGray
/BitsPerComponent 8
/Filter /FlateDecode
/Length 1174>>
my P4
data :
<</Type /XObject
/Subtype /Image
/Width 1090
/Height 1090
/ColorSpace /DeviceGray
/BitsPerComponent 1
/Filter /FlateDecode
/DecodeParms <</Columns 1090 /Colors 1 /Predictor 15 /BitsPerComponent 1>>
/Length 2028>>
@aheissenberger do you need any help for your pr ?
@aheissenberger do you need any help for your pr ?
@ajira86 I need to find the time ;-) and will ask for help if I have a problem - Thanks :-)