pypdf BUG: NameObject doesn't handle all values correctly

Name object encodes all characters from 1 to 255, so it's not correct to use utf-8.

7.3.5 Name objects
Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0).

Apr 16 '24 20:04 Rak424

@Rak424 on page 18 of PDF32000-1:2008 you will see

The 8bits only encoding will be done with Latin1

Apr 16 '24 21:04 pubpub-zz

@Rak424 please provide a pdf with the issue you are trying to fix. without I can not consider your PR as acceptable

Apr 17 '24 19:04 pubpub-zz

Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance. I think it's worth considering in merging this or not.

To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.

Apr 17 '24 19:04 art049

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.92%. Comparing base (6d9a7ec) to head (508c82c). Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2601      +/-   ##
==========================================
- Coverage   94.92%   94.92%   -0.01%     
==========================================
  Files          50       50              
  Lines        8316     8309       -7     
  Branches     1667     1665       -2     
==========================================
- Hits         7894     7887       -7     
  Misses        261      261              
  Partials      161      161

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Apr 17 '24 20:04 codecov[bot]

We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance.

If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?

Apr 18 '24 07:04 stefan6419846

Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance. I think it's worth considering in merging this or not.

To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.

The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions. Are you dealing with decoding? Encoding?

Apr 18 '24 11:04 pubpub-zz

If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?

It may seem small, but with over 4.5M downloads per month, I think the cumulated impact of such a change is significant.

The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions. Are you dealing with decoding? Encoding?

Yes maybe this speedup means there is a regression somewhere. Overall the bench includes encoding afaik:

Apr 18 '24 16:04 art049

Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:

bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio)
print(bio.getvalue)
> b'/#C3#BF'

The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.

A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:

NameObject(b"hello")
NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')

utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.

I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.

Apr 18 '24 18:04 Rak424

Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:
bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio)
print(bio.getvalue)
> b'/#C3#BF'

b'/\xc3\xbf'.decode("utf-8") returns properly ÿ which is equivalent to b'/\xff'.decode("latin-1")

for me there is no error.

The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.

A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:
NameObject(b"hello")
NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')
utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.

I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.

remember thatthe current code provides with NameObject.CHARSETS a way to add / remove charsets from the possible solutions and the default, based on other PRs uses the utf-8, gbk and then latin-1. all these cases come from actual PDFs. have you experienced any actual issues with some PDF?

Apr 18 '24 19:04 pubpub-zz

b"/#FF" corresponds to value b'\xff' and b'/#C3#BF' to value b'\xc3\xbf', that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value, you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255, and current implementation of name object only allows str and not bytes in constructor, so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless. I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf. Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.

Apr 19 '24 22:04 Rak424

b"/#FF" corresponds to value b'\xff' and b'/#C3#BF' to value b'\xc3\xbf', that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value,

you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255and current implementation of name object only allows str and not bytes in constructor.

NameObjects are actually string like : the proof is about using the encoding with the '#' character. The byte encoding is just a matter of file encoding.

so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless.

This fix was introduced years ago. I' trying some archeology to find it

I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf.

Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.

I agree that the current implementation does not match the specification:

The historical approach in pypdf is more permissive and as we are rewriting all the names with PdfWriter, there is no issue with readers. Moving to a full compliant solution may be introduce lots of regression that will be very tricky to detect (NameObjects are a sort of lazy linking) As long as this PR does not fix an existing issue I'm not fond of it.

@MartinThoma / @MasterOdin / @stefan6419846 can you give your opinions?

Apr 20 '24 07:04 pubpub-zz

I have to admit that it is hard to follow this for me. If I understand it correctly, there are some different interpretations of the PDF specs regarding name objects.

In my understanding, names are just an arbitrary list of bytes. An application has to ensure a consistent handling of them for matching purposes, but as they are mostly irrelevant to the user and primarily about the internal structure, there are no further rules except for predefined sequences, for example as part of dictionary keys.

The newly introduced test sequences from ISO 32000-2:2020, Table 4 pass with the old implementation as well, thus I do not see any real need for changing anything about this unless there is a real example where the current implementation fails and the new one does not.

Apr 20 '24 08:04 stefan6419846

to be more clear, here is two examples that can lead to problems hard to detect:

name value is modified after reading/writing operations:

bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/DocuSign#AE"), None).write_to_stream(bio)
print(bio.getvalue())  # > b'/DocuSign#C2#AE' != b"/DocuSign#AE"

all delimiter characters should be escaped, but it's not the case:

bio = BytesIO()
NameObject("/#()<>[]{}/%").write_to_stream(bio)
print(bio.getvalue())  # > b'/#23#28#29<>[]{}#2F#25'

Apr 29 '24 19:04 Rak424

@Rak424 can you provide actual PDF and code that shows some issues and not unitary. About the two examples you are providing,

even if the binary encoding are different, all names will be encoded in the same way and should match
the PDF specification states:

"Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation." the current implementation is valid

Apr 30 '24 20:04 pubpub-zz

hi

about example 2, from spec:

c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.

All characters except the white-space characters and delimiters are referred to as regular characters.

It's not a huge problem since it's easy to any parser to handle it, but could be a problem with an extremely strict parser.

The main problem is on example 1: The original value is b'DocuSign\xae' who has no UTF8 representation and corresponds to 'DocuSign®' in latin1 The final value is b'DocuSign\xc2x\ae' who corresponds to 'DocuSign®' in utf8 and 'DocuSignÂ®' in latin1

So, it means that if you append a pdf, it will creates a modification on these kinds of values. On a signed pdf where you add revocation infos, it will create a modification even if you don't modify the document.

May 05 '24 07:05 Rak424

@Rak424 I see your example. However both b'DocuSign\xc2\xae' and b'DocuSign\xae' are translated into 'DocuSign®' NameObject. Again I agree that the 2 forms are binary different so could be considered as different whereas pypdf will consider them as identical, but I do not see actual use cases where both versions will be used in the same PDF. If a document uses version 2, all instances will be translated for version 1 and the document will be properly read.

If we do a change to keep the bytes internally, b'DocuSign\xae' will be stored, but without changing the arguments, NameObject('DocuSign®') will get internally b'DocuSign\xc2\xae' and matching will not work anymore. This will require lots of change and will break the API and induce lots of side effects in all programs. This is not acceptable

up to now you have not reported high level PDF that are failing. without I would consider your report as a acceptable deviation of pypdf to the PDF specification

May 05 '24 17:05 pubpub-zz

do you understand that b'DocuSign\xc2\xae' is not b'DocuSignx\ae' ?

May 05 '24 18:05 Rak424

I do if you consider a full compliance to the spec. however as said I do not consider that both can be met at the same time. Also we need a way to build a NameObject from an str that can match one or the other, or both.

Please provide a PDF file showing some issue

May 06 '24 20:05 pubpub-zz

pypdf pypdf copied to clipboard

BUG: NameObject doesn't handle all values correctly

Codecov Report

pypdf
pypdf copied to clipboard