jbig2 extraction can be significantly simplified
Feature request
The current code for extracting a jbig file involves decoding a data stream into segments and decoding segment headers, and then re-encoding the segment back into a data stream. The output of JBIG2StreamWriter.write_segments should be identical to the original input stream.
JBig2StreamWriter(output_stream).write_segments(JBIG2StreamReader(input_stream).get_segments(), fix_last_page=False)
output_stream.seek(0)
input_stream.seek(0)
assert output_stream.read() == input_stream.read()
Given that, the only changes that JBIG2StreamWriter.write_file is making to the input is to add a header and add the tail material. The header does not need to know about the contents of the input stream at all, and the tail material only needs to know how many segments were in the input stream and nothing else.
Given that's all we need to know, we could dramatically reduced the code footprint of jbig2 extraction with something like this.
output_stream.write(header)
num_segments = count_segments(input_stream)
input_stream.seek(0)
output_stream.write(input_stream)
tail = tail_material(num_segments)
output_stream.write(tail)
This is ~exactly right~ mostly correct (see below), and in fact you don't even need to count the segments, because nobody cares, it's an EOF, it couldn't possibly be anything else. See https://github.com/dhdaines/playa/pull/136 for how to do it.
Note, however:
- There are quite often global segments shared between all the JBIG2 images in a PDF, and you need to tack them on to the start of the output. I don't know if the segment numbers are supposed to get renumbered because of this, but pdfminer.six doesn't do that (but it does include the global segments).
- I don't know (is JBIG2 an open standard?) if segment numbers are supposed to be strictly sequential, so simply counting the segments as you are proposing to do might not give you the correct segment number for the end of file segment.
- pdfminer.six is also trying to make things strictly compliant by adding an end of page segment if one doesn't exist (which it probably doesn't) before the end of file segment. I don't think this is necessary, but...
The PLAYA approach is good enough to satisfy jbig2dec, at least, and is much less susceptible to bugs, security holes, etc.