pdfrw icon indicating copy to clipboard operation
pdfrw copied to clipboard

Use bytestrings for streams

Open bshillingford opened this issue 6 years ago • 4 comments

Currently, streams are stored as unicode strings (Python 3.5.2, latest pip version of pdfrw), and the pdfrw's utils convert to/from Latin-1 encoding on-the-fly, which is quite fragile for binary data.

bshillingford avatar Sep 17 '18 00:09 bshillingford

You're welcome to your opinion, but my opinion is that needing to write code like if a == b'>': is much more fragile than code like if a == '>':

The brain-dead half-assed attempt to completely separate strings from bytes in Python 3.0 went completely awry, and it wasn't until 3.3 that that shit was even supportable again.

Yes, it still chaps my ass that 'a' != b'a'. There was never a good reason for this. It was told to the maintainers by many people at the time that it was beyond stupid, but at that time the maintenance cartel was all-web, all the time. It's the most major fuck up in the language, even worse than making division different than any other language and repurposing the C++ line comment character.

tl;dr a) don't get me started; b) the usage of the strings internal to pdfrw requiring assignment from, or comparison to, string literals completely swamps the number of I/O touchpoints, c) remembering to prepend string literals with 'b' everywhere is quite stupid and error prone, and d) don't use vague phrases like "quite fragile" unless you can prove with a unittest that something doesn't work.

pmaupin avatar Sep 17 '18 04:09 pmaupin

Whoa, didn't mean to open this can of worms. I agree with you about the shortcomings of bytestrings. I typed that hastily at night and probably wasn't clear: as a library user it's unexpected to find a string used for binary data, but data representation inside the library can be lists of ints for all I care.

Perhaps a documentation line in the README mentioning the encoding, or a bytestream property that translates to bytes on demand? Happy to send either as a PR, let me know.

Great library by the way. I'm a fan of the idea of making the library handle the low-level file type stuff, and making its responsibility stop there.

bshillingford avatar Sep 17 '18 09:09 bshillingford

Ahh, that makes sense. Sorry for being so snappish, but the what-kind-of-string thing has come up in multiple contexts too many times before in issues and PRs. I agree that some doc might be useful for users who are formatting their own stream data, and I like the idea of a bytestream property, so a PR to do that would be useful. It should use the convert_load and convert_store functions from the py23_diffs module.

Thanks, Pat

pmaupin avatar Sep 17 '18 16:09 pmaupin

For the record, a workaround would be to apply .encode('latin-1') to the string. In my case I do something like this to inflate a compressed stream:

zlib.decompress(x.Root.Pages.Kids[0].Contents.stream.encode('latin-1'))

aspotashev avatar Oct 24 '18 14:10 aspotashev