pypdf
pypdf copied to clipboard
BUG: NameObject doesn't handle all values correctly
Name object encodes all characters from 1 to 255, so it's not correct to use utf-8.
7.3.5 Name objects
Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0).
@Rak424
on page 18 of PDF32000-1:2008 you will see
The 8bits only encoding will be done with Latin1
@Rak424 please provide a pdf with the issue you are trying to fix. without I can not consider your PR as acceptable
Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter
performance. I think it's worth considering in merging this or not.
To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 94.92%. Comparing base (
6d9a7ec
) to head (508c82c
). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2601 +/- ##
==========================================
- Coverage 94.92% 94.92% -0.01%
==========================================
Files 50 50
Lines 8316 8309 -7
Branches 1667 1665 -2
==========================================
- Hits 7894 7887 -7
Misses 261 261
Partials 161 161
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance.
If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?
Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the
PDFWriter
performance. I think it's worth considering in merging this or not.
To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.
The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions. Are you dealing with decoding? Encoding?
If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?
It may seem small, but with over 4.5M downloads per month, I think the cumulated impact of such a change is significant.
The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions. Are you dealing with decoding? Encoding?
Yes maybe this speedup means there is a regression somewhere. Overall the bench includes encoding afaik:
Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:
bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio)
print(bio.getvalue)
> b'/#C3#BF'
The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.
A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:
NameObject(b"hello")
NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')
utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.
I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.
Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:
bio = BytesIO() NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio) print(bio.getvalue) > b'/#C3#BF'
b'/\xc3\xbf'.decode("utf-8")
returns properly ÿ which is equivalent to b'/\xff'.decode("latin-1")
for me there is no error.
The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.
A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:
NameObject(b"hello") NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')
utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.
I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.
remember thatthe current code provides with NameObject.CHARSETS a way to add / remove charsets from the possible solutions and the default, based on other PRs uses the utf-8, gbk and then latin-1. all these cases come from actual PDFs. have you experienced any actual issues with some PDF?
b"/#FF"
corresponds to value b'\xff'
and b'/#C3#BF'
to value b'\xc3\xbf'
, that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value, you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255, and current implementation of name object only allows str and not bytes in constructor, so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless. I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf. Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.
b"/#FF"
corresponds to valueb'\xff'
andb'/#C3#BF'
to valueb'\xc3\xbf'
, that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value,
you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255and current implementation of name object only allows str and not bytes in constructor.
NameObjects are actually string like : the proof is about using the encoding with the '#' character. The byte encoding is just a matter of file encoding.
so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless.
This fix was introduced years ago. I' trying some archeology to find it
I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf.
Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.
I agree that the current implementation does not match the specification:
The historical approach in pypdf is more permissive and as we are rewriting all the names with PdfWriter, there is no issue with readers. Moving to a full compliant solution may be introduce lots of regression that will be very tricky to detect (NameObjects are a sort of lazy linking) As long as this PR does not fix an existing issue I'm not fond of it.
@MartinThoma / @MasterOdin / @stefan6419846 can you give your opinions?
I have to admit that it is hard to follow this for me. If I understand it correctly, there are some different interpretations of the PDF specs regarding name objects.
In my understanding, names are just an arbitrary list of bytes. An application has to ensure a consistent handling of them for matching purposes, but as they are mostly irrelevant to the user and primarily about the internal structure, there are no further rules except for predefined sequences, for example as part of dictionary keys.
The newly introduced test sequences from ISO 32000-2:2020, Table 4 pass with the old implementation as well, thus I do not see any real need for changing anything about this unless there is a real example where the current implementation fails and the new one does not.
to be more clear, here is two examples that can lead to problems hard to detect:
- name value is modified after reading/writing operations:
bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/DocuSign#AE"), None).write_to_stream(bio)
print(bio.getvalue()) # > b'/DocuSign#C2#AE' != b"/DocuSign#AE"
- all delimiter characters should be escaped, but it's not the case:
bio = BytesIO()
NameObject("/#()<>[]{}/%").write_to_stream(bio)
print(bio.getvalue()) # > b'/#23#28#29<>[]{}#2F#25'
@Rak424 can you provide actual PDF and code that shows some issues and not unitary. About the two examples you are providing,
- even if the binary encoding are different, all names will be encoded in the same way and should match
- the PDF specification states:
"Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation." the current implementation is valid
hi
about example 2, from spec:
c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.
All characters except the white-space characters and delimiters are referred to as regular characters.
It's not a huge problem since it's easy to any parser to handle it, but could be a problem with an extremely strict parser.
The main problem is on example 1: The original value is b'DocuSign\xae' who has no UTF8 representation and corresponds to 'DocuSign®' in latin1 The final value is b'DocuSign\xc2x\ae' who corresponds to 'DocuSign®' in utf8 and 'DocuSign®' in latin1
So, it means that if you append a pdf, it will creates a modification on these kinds of values. On a signed pdf where you add revocation infos, it will create a modification even if you don't modify the document.
@Rak424 I see your example. However both b'DocuSign\xc2\xae' and b'DocuSign\xae' are translated into 'DocuSign®' NameObject. Again I agree that the 2 forms are binary different so could be considered as different whereas pypdf will consider them as identical, but I do not see actual use cases where both versions will be used in the same PDF. If a document uses version 2, all instances will be translated for version 1 and the document will be properly read.
If we do a change to keep the bytes internally, b'DocuSign\xae' will be stored, but without changing the arguments, NameObject('DocuSign®') will get internally b'DocuSign\xc2\xae' and matching will not work anymore. This will require lots of change and will break the API and induce lots of side effects in all programs. This is not acceptable
up to now you have not reported high level PDF that are failing. without I would consider your report as a acceptable deviation of pypdf to the PDF specification
do you understand that b'DocuSign\xc2\xae' is not b'DocuSignx\ae' ?
I do if you consider a full compliance to the spec. however as said I do not consider that both can be met at the same time. Also we need a way to build a NameObject from an str
that can match one or the other, or both.
Please provide a PDF file showing some issue