sumatrapdf icon indicating copy to clipboard operation
sumatrapdf copied to clipboard

Not displaying some text fields in tax document

Open sshock opened this issue 10 months ago • 22 comments

SumatraPDF version

  • 3.5.2 and pre-release

Describe the bug Several fields in the attached document do not show up.

To Reproduce Steps to reproduce the behavior:

  1. Open the attached document.
  2. Notice several boxes are empty, including TRUSTEE, RECIPIENT'S TIN, Gross distribution, Distribution code, and Account number.

Expected behavior All the boxes mentioned above should have values in them, e.g., Gross distribution should have a 1234.56. These fields all show up fine in every other PDF viewer I have tested, including Firefox and Chrome.

File that reproduces the problem See attached f1099sa.pdf.

Screenshots This screenshot shows Firefox on the left and SumatraPDF on right, with areas highlighted in red to show where fields are missing.

Image

Additional context It appears this document is using Widget annotations, and the ones that do not show up are the ones with an /AP (appearance stream).

f1099sa.pdf

sshock avatar Feb 01 '25 21:02 sshock

A) SumatraPDF does not handle Adobe proprietary JavaScript enhanced JetForms (XFA etc.) and many .Gov forms are designed to be used only on Adobe Served readers where results can be monitored and verified using scripts and security hashes.. see reasons below

B) there are errors in that file that will clear when the FDF is eported& reimported so we then see what is valid ACROform (not XFAform) data. UNOFFICIAL COPY

PDF Producer: PDFsharp 6.1.1 (Original: Acrobat Distiller 21.0 (Windows)) PDF Version: 1.6

Image

OFFICIAL COPY Modified: 2025-02-01 22:25:21 PDF Producer: Adobe LiveCycle Designer ES 9.0 PDF Version: 1.7 Fast Web View Yes PDF Optimizations: Tagged PDF Number of Pages: 5

Image

Image

Image

b)filled on an Adobe scripted platform

OFFICIAL COPY Modified: 2025-02-01 22:25:21 PDF Producer: Adobe LiveCycle Designer ES 9.0 PDF Version: 1.7 Fast Web View Yes PDF Optimizations: Tagged PDF Number of Pages: 5

Image

ALL DATA ACCOUNTED FOR no issue displaying Adobe Server written contents.

Image

GitHubRulesOK avatar Feb 01 '25 22:02 GitHubRulesOK

Interesting. I didn't notice any Javascript when I analyzed it, even after extracting and uncompressing all the streams.

Also, I've discovered the fields do show up if I remove the /AP (appearance stream) dictionaries. See attached document where I blanked out the contents of the /AP dictionaries and how SumatraPDF is now able to view all the fields.

So it seems like this could be relatively easy to fix. Why give up so quickly?

Image

f1099s - fixed.pdf

sshock avatar Feb 01 '25 23:02 sshock

Hints on how to Bulk Fill Govenment XFA

Open the FDF which loads the PDF for trusting now and in bulk.

Image This should Trigger the XFA extended Features which this form seem to be XFDF ONLY !! Image

iText (before their takeover) advised "Flattening" Fields using their commercial products to private Forms which is probably why Sharp was used in 2021 to break that file. However "Collateral Damage" such files are not Official and thus you risk a departmental scrutiny as to why they are now a poor submission.

When you import the XFDF it will AutoFill the form and you can programmatically save the new PDF or transmit the XFDF without a user necessarily knowing.

Image

GitHubRulesOK avatar Feb 01 '25 23:02 GitHubRulesOK

Try it here is the XFD so place in the same folder as an approved blank 1099-SA

f1099sa.zip

For FDF it would be simple drag and drop for XML XMP you need to IMport into acrobat Reader Form !!. There are easier methods but MuPDF does not attempt to support XFA as depreciated and Adobe licensed as NON open ISO PDF To be avoided at all costs.

GitHubRulesOK avatar Feb 01 '25 23:02 GitHubRulesOK

This wasn't a form I made. I downloaded it from my HSA provider. (I just modified it to replace my personal info with fake data before attaching it to this bug.)

The goal is that I and others be able to view tax forms provided by this major HSA provider.

sshock avatar Feb 02 '25 00:02 sshock

Been through this hoop elswhere the forms were not Open Source PDF but Adobe Licensed XFA some less scrupulous "Template" sites (and there are hundreds of such) will offer crippled copies as if official to service providers when it is very clear they are not likely to be sanctioned. So First ask IRS / .gov etc. is the format acceptable.

If you are required to file Form 5498-SA, you must provide a statement to the participant (generally Copy B) by June 2, 2025. You may, but you are not required to, provide the participant with a statement of the December 31, 2024, FMV of the participant's account by January 31, 2025. For more information about statements to participants, see part M in the 2024 General Instructions for Certain Information Returns.

The Taxpayer First Act of 2019 authorized the Department of the Treasury and the IRS to issue regulations that reduce the 250-return e-file threshold. T.D. 9972, published February 23, 2023, lowered the e-file threshold to 10 (calculated by aggregating all information returns), effective for information returns required to be filed on or after January 1, 2024. Go to IRS.gov/InfoReturn for e-file options.

Since it is the provider who complies. Then the question is are you acting on behalf of 10 or more recipients?

They will possibly say use a printer and submit as if a paper print not a recognised form you can read the instructions and download valid PDF at https://www.irs.gov/instructions/i1099sa#en_US_2024_publink100044357

GitHubRulesOK avatar Feb 02 '25 00:02 GitHubRulesOK

I just want to be able to view and maybe print these forms, and SumatraPDF has been my favorite PDF reader for many years now.

I'm not acting on behalf of anyone else. I only mention this form comes from a major HSA provider to indicate that likely many other SumatraPDF users will have trouble viewing their tax documents.

If SumatraPDF is able to view most Widget annotations, just not ones with an /AP appearance stream, why is that not something we want to look into fixing?

sshock avatar Feb 02 '25 01:02 sshock

There is a lot more to the problems in that part of the standards. As stated XFA sourced PDF are not valid as meeting Adobe published standard but serve their in house commercial users. When XFA is flattened to "ISO Public Open Standard" some AcroForm (not all components) may have been correctly edited (as you say remove the non working appearance and the fall back is seen) However those are just the tip of an iceberg of related issues about Rich Text which MuPDF generally does not support at present.

SumatraPDF cannot strictly "edit" an existing entry but if it is a standard valid "Comments" (not form) widget it may be able to replace it with one with a new appearance, hence it can move COMMENT text boxes etc. (but may degrade some that include Unicode)

GitHubRulesOK avatar Feb 02 '25 01:02 GitHubRulesOK

Interestingly, this document shows up fine in older versions of SumatraPDF (3.1.2 and earlier).

sshock avatar Feb 02 '25 04:02 sshock

It seems most of your hesitation about looking into this stems from believing this document uses XFA, however I'm pretty confident it's using AcroForm.

You can see the document catalog has an /AcroForm object 44 0, which contains the /Fields array with all the field object references.

And of particular importance, the AcroForm dictionary contains this entry:

/NeedAppearances true

If I remove that (it defaults to false), all PDF readers exhibit the same problem as SumatraPDF.

The spec describes NeedAppearances as:

A flag specifying whether to construct appearance streams and appearance dictionaries for all widget annotations in the document (see 12.7.3.3, “Variable Text”).

It's not surprising the fields show up empty if NeedAppearances is false or missing (or unsupported), because these fields' appearance streams are practically empty (just have like /Tx BMC EMC in them).

So my conclusion is:

  1. This has nothing to do with XFA; it's an AcroForm.
  2. The AcroForm dictionary has /NeedAppearances true
  3. MuPDF currently lacks support for NeedAppearances.

sshock avatar Feb 03 '25 00:02 sshock

Ok lets Presume it counts as a regression from 3.1.2 when MuPDF behaviours were different. I can reopen on the basis your own copy is a PDF from a crippled XFA but it does not mean that MuPDF has to work with any such content.

Will tag as a MuPDF difference but it may end up as a "wont fix"

@kjk over to you !

Image

The miss working form clearly works when saved out from Acrobat software such as Adobe Reader.

Image

GitHubRulesOK avatar Feb 03 '25 00:02 GitHubRulesOK

I see that NeedAppearances is deprecated in PDF 2.0.

Image

However, there are still a lot of %PDF-1.x documents out there, so we probably want to keep supporting it and fix this regression.

I have verified that the SumatraPDF 3.1.2 code supported this flag; maybe it just got removed on accident due to the amount of refactoring that happened then. Perhaps adding it back in would be real easy.

sshock avatar Feb 03 '25 06:02 sshock

This patch fixes the regression, though I don't know if I did it in the appropriate way.

diff --git a/mupdf/include/mupdf/pdf/name-table.h b/mupdf/include/mupdf/pdf/name-table.h
index 598f58d87..8da932952 100644
--- a/mupdf/include/mupdf/pdf/name-table.h
+++ b/mupdf/include/mupdf/pdf/name-table.h
@@ -367,6 +367,7 @@ PDF_MAKE_NAME("N", N)
 PDF_MAKE_NAME("Name", Name)
 PDF_MAKE_NAME("Named", Named)
 PDF_MAKE_NAME("Names", Names)
+PDF_MAKE_NAME("NeedAppearances", NeedAppearances)
 PDF_MAKE_NAME("NewWindow", NewWindow)
 PDF_MAKE_NAME("Next", Next)
 PDF_MAKE_NAME("NextPage", NextPage)
diff --git a/mupdf/source/pdf/pdf-appearance.c b/mupdf/source/pdf/pdf-appearance.c
index ab075994a..d4c7c77ff 100644
--- a/mupdf/source/pdf/pdf-appearance.c
+++ b/mupdf/source/pdf/pdf-appearance.c
@@ -3559,6 +3559,13 @@ retry_after_repair:
 				local_synthesis = 1;
 		}
 
+		/* Need to reconstruct appearance streams on all widgets if NeedAppearances is true */
+		if (subtype == PDF_NAME(Widget))
+		{
+			if (ap_n && pdf_to_bool(ctx, pdf_dict_getl(ctx, pdf_trailer(ctx, annot->page->doc), PDF_NAME(Root), PDF_NAME(AcroForm), PDF_NAME(NeedAppearances), NULL)))
+				local_synthesis = 1;
+		}
+
 		/* We need to put this appearance stream back into the document. */
 		needs_resynth = pdf_annot_needs_resynthesis(ctx, annot);
 		if (needs_resynth)

sshock avatar Feb 03 '25 06:02 sshock

With the fix in place, all the fields show up and look great, except for the TRUSTEE company name and address, which has extra line spacing that shouldn't be there:

SumatraPDF 3.1.2 and all other PDF viewers display it correctly. I think the problem with current SumatraPDF is that it treats \r\n as two newlines instead of one.

Image

sshock avatar Feb 03 '25 06:02 sshock

For this other regression with the extra newlines, I can see that the old MuPDF code in SumatraPDF 3.1.2 had logic to treat \r as a newline only when not followed by a \n as seen with this code from pdf_append_line():

			if (*end == '\n' || *end == '\r' && *(end + 1) != '\n')
				break;

In contrast, looking at the current code, I see no such logic in the break_string() method, or the write_string_with_quadding() that calls it.

I'm not exactly sure how to fix this one but if I get time I may take a stab at it...

sshock avatar Feb 03 '25 07:02 sshock

Here's a new patch that fixes both issues:

diff --git a/mupdf/include/mupdf/pdf/name-table.h b/mupdf/include/mupdf/pdf/name-table.h
index 598f58d87..a1b447a9f 100644
--- a/mupdf/include/mupdf/pdf/name-table.h
+++ b/mupdf/include/mupdf/pdf/name-table.h
@@ -367,6 +367,7 @@ PDF_MAKE_NAME("N", N)
 PDF_MAKE_NAME("Name", Name)
 PDF_MAKE_NAME("Named", Named)
 PDF_MAKE_NAME("Names", Names)
+PDF_MAKE_NAME("NeedAppearances", NeedAppearances)
 PDF_MAKE_NAME("NewWindow", NewWindow)
 PDF_MAKE_NAME("Next", Next)
 PDF_MAKE_NAME("NextPage", NextPage)
diff --git a/mupdf/source/pdf/pdf-appearance.c b/mupdf/source/pdf/pdf-appearance.c
index ab075994a..ea9482c5e 100644
--- a/mupdf/source/pdf/pdf-appearance.c
+++ b/mupdf/source/pdf/pdf-appearance.c
@@ -1906,6 +1906,9 @@ write_string_with_quadding(fz_context *ctx, fz_buffer *buf,
 				write_string(ctx, buf, lang, font, fontname, size, a, b-1);
 			else
 				write_string(ctx, buf, lang, font, fontname, size, a, b);
+			// If \r followed by \n, skip the \n; \r\n is a single newline not two.
+			if (b[-1] == '\r' && b[0] == '\n')
+				++b;
 			a = b;
 			px = x;
 		}
@@ -2043,6 +2046,9 @@ layout_string_with_quadding(fz_context *ctx, fz_layout_block *out,
 				layout_string(ctx, out, lang, font, size, xorig+x, y, a, b);
 				add_line_at_end = 0;
 			}
+			// If \r followed by \n, skip the \n; \r\n is a single newline not two.
+			if (b[-1] == '\r' && b[0] == '\n')
+				++b;
 			a = b;
 			y -= lineheight;
 		}
@@ -3559,6 +3565,13 @@ retry_after_repair:
 				local_synthesis = 1;
 		}
 
+		/* Need to reconstruct appearance streams on all widgets if NeedAppearances is true */
+		if (subtype == PDF_NAME(Widget))
+		{
+			if (ap_n && pdf_to_bool(ctx, pdf_dict_getl(ctx, pdf_trailer(ctx, annot->page->doc), PDF_NAME(Root), PDF_NAME(AcroForm), PDF_NAME(NeedAppearances), NULL)))
+				local_synthesis = 1;
+		}
+
 		/* We need to put this appearance stream back into the document. */
 		needs_resynth = pdf_annot_needs_resynthesis(ctx, annot);
 		if (needs_resynth)

sshock avatar Feb 03 '25 07:02 sshock

Should I submit a PR?

sshock avatar Mar 01 '25 22:03 sshock

@sshock

Great work and it could well be value as a PR but I suspect it may be a while for review

Fixing the \r \n issue would alone be useful for text annotation as that is long standing bug issue opened by me.

I cant do anything to authorise but could probably test it works well on 32 bit recompile However @kjk seems to be currently otherwise well occupied and thus updated pre-release "slow" to be seen

GitHubRulesOK avatar Mar 02 '25 01:03 GitHubRulesOK

@sshock

P.S. I should have said if these are core fixes to current MuPDF code they should ideally be raised UPstream with Artifex so they are cascaded down onto SumatraPDF update from MuPDF.

GitHubRulesOK avatar Mar 02 '25 02:03 GitHubRulesOK

@sshock Looking closer I suggest you treat as 2 separate MuPDF issues MuPDF editor will replace the enter key on windows as (This is a text...\nThis is a newline) in other words they use a literal \n and adapt the text appearance from one line to 2 thus it may not be as simple as skip the n also seem to remember when it is the choice of the 3 types it is oddly (an illogical vertical shift without \return) \n to represent LF line break or 0x0A in hexadecimal notation Adobe mandate Linux \n whilst traditionally Mac used \r in many places but now macOS starting with Mac OS X 10.0 only uses LF.

Secondly the PDF/A overview about appearances Is I THINK that they should have been supplied before render! Thus perhaps no longer the readers responsibility? (I would need to check that out) So that may also be a potential MuPDF "wontfix" response. the writer of the fields should have done its task correctly ?

GitHubRulesOK avatar Mar 03 '25 16:03 GitHubRulesOK

@sshock Looking closer I suggest you treat as 2 separate MuPDF issues

Sounds good. I will try to create a couple PRs when I get a chance later.

MuPDF editor will replace the enter key on windows as (This is a text...\nThis is a newline) in other words they use a literal \n

Yeah, this makes sense as the PDF spec defines several escape sequences that can be used in a literal string object, including \r and \n.

and adapt the text appearance from one line to 2 thus it may not be as simple as skip the n

Note by the time it reaches this code, the string object has already been parsed so that any \r have already been turned into a \r (byte 0x0D) and any \n have already been turned into a \n (byte 0x0A).

When break_string encounters a \r or \n, it breaks out, so each one causes a new line. Skipping a \n that comes immediately after a \r should work perfectly so that \r\n only causes one newline.

also seem to remember when it is the choice of the 3 types it is oddly (an illogical vertical shift without \return) \n to represent LF line break or 0x0A in hexadecimal notation Adobe mandate Linux \n whilst traditionally Mac used \r in many places but now macOS starting with Mac OS X 10.0 only uses LF.

Yeah, DOS/Windows have always used \r\n and older macOS used \r, then switched to \n with OS X since it's now UNIX-based, and of course UNIX and Linux have always used only \n.

The good news is, the way I implemented the fix it will handle all 3 types well; any \r or \n will be treated as a newline, but if the \r is followed by a \n, it is treated as a single newline not two.

Secondly the PDF/A overview about appearances Is I THINK that they should have been supplied before render!

Agree, it seems like a crippled PDF to not supply its own appearances, but for whatever reason we can see the spec purposely allowed that, as the whole purpose of /NeedAppearances true is to request the viewer to construct the appearance streams.

Thus perhaps no longer the readers responsibility? (I would need to check that out) the writer of the fields should have done its task correctly ?

But that's exactly what /NeedAppearances true means. It means it is the readers responsibility "to construct appearance streams and appearance dictionaries for all widget annotations in the document".

So that may also be a potential MuPDF "wontfix" response.

Hopefully they will accept it considering:

  • The spec seems pretty clear
  • Older version of MuPDF did support it, so it's a regression
  • The fix is easy and short and I am providing the fix in the PR

I think the only argument for "wontfix" will be that NeedAppearances has been deprecated in PDF 2.0. But I think there are still a lot of PDF 1.x documents out there, so why would we want them to be broken when there is an easy fix ready to go?

sshock avatar Mar 04 '25 04:03 sshock

My comments are only based on observation that MuPDF already works so no need for MuPDF fixing. There was another fork proposed to adapt the MuPDF code a different way for similar reasons. AFAIK It is the Windows TextBox editor does the damage as MS insist it be their keyboard TEXTbox method in VS. I am suspecting it is the VStudio interface is the issue as tried different replace string ways and failed in the SumatraPDF entry point but I dont know C++ enough to switch that to a Rich "binary" text box entry. Ideally there should be no need to change the master MuPDF code (causing other fails) unless proven it is the cause of the issue.

As a former Standards Compliance Officer I understand there are many conflicts between writer and reader interpretation of WOOLY Adobe GUIDANCE (PDF specification) now an ISO Standard. and about a week ago saw a change in PDF/A ruling on status of signature and other fields visibility but cant recollect where:-). Most likely a correction on interpretation of currently active ISO Standard.

I am the messenger thus not a ruling authority, however, looking at a similar complaint about Fields raised with Artifex.

The result of drawing nothing to the page stream is that there is nothing drawn on the page your AcroForm has an Appearance stream which specifically draws nothing on the page. Some PDF consumers (specifically Adobe Acrobat) seem to generally ignore the appearance stream, and generate a new one from the other information in the Widget. We feel that if an application embeds an Appearance stream then we should honour that; the application knows how it wanted the appearance to look and we should not decide we know better and create a different one.

GitHubRulesOK avatar Mar 04 '25 14:03 GitHubRulesOK