PDFsharp icon indicating copy to clipboard operation
PDFsharp copied to clipboard

Incorrect parsing of REVERSE SOLIDUS in literal string

Open Greybird opened this issue 1 year ago • 0 comments

Reporting an Issue Here

When parsing some files, I noticed some Info Elements are showing incorrect values. For example, for this file, the Producer tag:

  • is shown by Acrobat as C48x Series (PDF - 300X300 dpi). image
  • is parsed by PDFSharp as C48x Series (DF - 300X300 dpi) (missing P)

Expected Behavior

When parsing literal string, when a REVERSE SOLIDUS is encountered with an immediate following character not part of Table 3 of 7.3.4.2 paragraph of ISO/DIS 32000-2, the REVERSE SOLIDUS should be ignored, but the following character should be kept.

Actual Behavior

When parsing literal string, when a REVERSE SOLIDUS is encountered with an immediate following character not part of Table 3 of 7.3.4.2 paragraph of ISO/DIS 32000-2, the REVERSE SOLIDUS is ignored, as well as the following character.

Steps to Reproduce the Behavior

[Fact]
public void ReverseSolidus_with_invalid_following_character_should_be_ignored()
{
    using var doc = PdfReader.Open(@"Cover-letter-4098208.pdf");
    var producer = doc.Info.Producer;
    producer.Should().Be("C48x Series (PDF - 300X300 dpi)");
}

Expected producer to be "C48x Series (PDF - 300X300 dpi)" with a length of 31, but "C48x Series (DF - 300X300 dpi)" has a length of 30, differs near "DF " (index 13).

The issue is most probably linked to an open question in the specification interpretation, as explained in this comment of Lexer.cs

Greybird avatar Aug 20 '24 08:08 Greybird