OpenPDF icon indicating copy to clipboard operation
OpenPDF copied to clipboard

ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear.

Open bfryer-snap opened this issue 1 year ago • 12 comments

Describe the bug The arabic reshaping is leading to characters not being rendered in the PDF when using some fonts. If I do not use the ColumnText, the characters appear.

To Reproduce

We can use a modified version of the RightToLeft.java example to show the issue:

Here it is working:

public static void main(String[] args) {
        try {
            // step 1
            Document document = new Document(PageSize.A4, 50, 50, 50, 50);
            // step 2
            PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
            // step 3
            document.open();
            // step 4
            PdfContentByte cb = writer.getDirectContent();

// Font can be found here:
// https://fonts.google.com/noto/specimen/Noto+Sans+Arabic?sort=popularity&subset=arabic
            BaseFont bf = BaseFont.createFont("NotoSansArabic-regular.ttf", BaseFont.IDENTITY_H, true);

            ColumnText ct = new ColumnText(cb);
            ct.setSimpleColumn(100, 100, 500, 800, 24, Element.ALIGN_LEFT);
            ct.setSpaceCharRatio(PdfWriter.NO_SPACE_CHAR_RATIO);
            ct.setLeading(0, 1);
            ct.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
            ct.setAlignment(Element.ALIGN_CENTER);
            ct.addText(new Chunk(ar1, new Font(bf, 16)));
            ct.addText(new Chunk(ar2, new Font(bf, 16, Font.NORMAL, Color.red)));
            ct.go();
            ct.setAlignment(Element.ALIGN_JUSTIFIED);
            ct.addText(new Chunk(ar3, new Font(bf, 12)));
            ct.go();
            ct.setAlignment(Element.ALIGN_CENTER);
            ct.addText(new Chunk(ar4, new Font(bf, 14)));
            ct.go();

            // step 5
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * arabic text
     */
    public static String ar1 = "\u0623\u0648\u0631\u0648\u0628\u0627, \u0628\u0631\u0645\u062c\u064a\u0627\u062a "
            + "\u0627\u0644\u062d\u0627\u0633\u0648\u0628 + \u0627\u0646\u062a\u0631\u0646\u064a\u062a :\n\n";
    /**
     * arabic text
     */
    public static String ar2 = "\u062a\u0635\u0628\u062d \u0639\u0627\u0644\u0645\u064a\u0627 \u0645\u0639 "
            + "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";
    /**
     * arabic text
     */
    public static String ar3 = "\u062a\u0633\u062c\u0651\u0644 \u0627\u0644\u0622\u0646 \u0644\u062d\u0636\u0648\u0631 "
            + "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0627\u0644\u062f\u0648\u0644\u064a "
            + "\u0627\u0644\u0639\u0627\u0634\u0631 \u0644\u064a\u0648\u0646\u064a\u0643\u0648\u062f, "
            + "\u0627\u0644\u0630\u064a \u0633\u064a\u0639\u0642\u062f \u0641\u064a 10-12 \u0622\u0630\u0627\u0631 "
            + "1997 \u0628\u0645\u062f\u064a\u0646\u0629 \u0645\u0627\u064a\u0646\u062a\u0633, "
            + "\u0623\u0644\u0645\u0627\u0646\u064a\u0627. \u0648\u0633\u064a\u062c\u0645\u0639 "
            + "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0628\u064a\u0646 \u062e\u0628\u0631\u0627\u0621 "
            + "\u0645\u0646  \u0643\u0627\u0641\u0629 \u0642\u0637\u0627\u0639\u0627\u062a "
            + "\u0627\u0644\u0635\u0646\u0627\u0639\u0629 \u0639\u0644\u0649 \u0627\u0644\u0634\u0628\u0643\u0629 "
            + "\u0627\u0644\u0639\u0627\u0644\u0645\u064a\u0629 \u0627\u0646\u062a\u0631\u0646\u064a\u062a "
            + "\u0648\u064a\u0648\u0646\u064a\u0643\u0648\u062f, \u062d\u064a\u062b \u0633\u062a\u062a\u0645, "
            + "\u0639\u0644\u0649 \u0627\u0644\u0635\u0639\u064a\u062f\u064a\u0646 "
            + "\u0627\u0644\u062f\u0648\u0644\u064a \u0648\u0627\u0644\u0645\u062d\u0644\u064a \u0639\u0644\u0649 "
            + "\u062d\u062f \u0633\u0648\u0627\u0621 \u0645\u0646\u0627\u0642\u0634\u0629 \u0633\u0628\u0644 "
            + "\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u064a\u0648\u0646\u0643\u0648\u062f  \u0641\u064a "
            + "\u0627\u0644\u0646\u0638\u0645 \u0627\u0644\u0642\u0627\u0626\u0645\u0629 "
            + "\u0648\u0641\u064a\u0645\u0627 \u064a\u062e\u0635 "
            + "\u0627\u0644\u062a\u0637\u0628\u064a\u0642\u0627\u062a "
            + "\u0627\u0644\u062d\u0627\u0633\u0648\u0628\u064a\u0629, \u0627\u0644\u062e\u0637\u0648\u0637, "
            + "\u062a\u0635\u0645\u064a\u0645 \u0627\u0644\u0646\u0635\u0648\u0635  "
            + "\u0648\u0627\u0644\u062d\u0648\u0633\u0628\u0629 \u0645\u062a\u0639\u062f\u062f\u0629 "
            + "\u0627\u0644\u0644\u063a\u0627\u062a.\n\n";
    /**
     * arabic text
     */
    public static String ar4 = "ع\u0646\u062f\u0645\u0627 \u064a\u0631\u064a\u062f "
            + "\u0627\u0644\u0639\u0627\u0644\u0645 \u0623\u0646 \u064a\u062a\u0643\u0644\u0651\u0645, "
            + "\u0641\u0647\u0648 \u064a\u062a\u062d\u062f\u0651\u062b \u0628\u0644\u063a\u0629 "
            + "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";

Output: Screenshot 2023-08-10 at 11 41 24 AM

If we leave everything the exact same as the above, except we change "NotoSansArabic-regular.ttf" to a different font, such as "GraphikArabic-Regular.ttf", Then we get the following output: Screenshot 2023-08-10 at 11 43 03 AM

The problem can be seen most easily by looking at the lower left section of the main paragaph. In the NotoSansArabic font, we can see a word that looks like it expands multiple characters. In the GraphikArabic font, we can see that it is missing the right half of the word and seems to only contain the last two characters.

A specific character that seems to be rendered by NotoSansArabic and not GraphikArabic is \u0627.

I thought that GraphikArabic was missing the \u0627 character altogether, but if I use the following code, i can generate it just fine:

    public static void main(String[] args) {
        try {
            // step 1
            Document document = new Document(PageSize.A4, 50, 50, 50, 50);
            // step 2
            PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
            // step 3
            document.open();
            // step 4
            PdfContentByte cb = writer.getDirectContent();
            BaseFont bf = BaseFont.createFont("GraphikArabic-regular.ttf", BaseFont.IDENTITY_H, true);
            Font font = new Font(bf);
            document.add(new Paragraph("\u0627   \u0627   \u0627", font));

            // step 5
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

screenshot of the output being as expected Screenshot 2023-08-10 at 11 47 13 AM

System.out.println(bf.charExists('\u0627')); also outputs true when using the GraphikArabic font. I assume that BaseFont::charExists(char) is the way to determine if the given char should show on the PDF.

I believe the issue is that characters like \u0627 are being reshaped into much different characters in a whole other unicode block and that the font does not support the reshaped characters. I believe this because when debugging, I can see that some characters such as 0x0627 become 0x0FE8E. This transformation happens here: https://github.com/LibrePDF/OpenPDF/blob/master/openpdf/src/main/java/com/lowagie/text/pdf/BidiLine.java#L197

Expected behavior

I expect ColumnText and adding elements to a document directly to have the same output OR I expect to be able to skip the "reshaping" process so that I can continue to use a font which supports the 0x0600 to 0x06FF character range.

Screenshots screenshots added above.

System (please complete the following information):

Additional context

bfryer-snap avatar Aug 10 '23 18:08 bfryer-snap

Thank you for reporting this bug. Please submit a pull request with a solution to this problem if you can.

andreasrosdal avatar Feb 14 '24 18:02 andreasrosdal

Neither 0x0627 nor 0x0FE8E can be found in BidiLine.java

Seems to be a problem with the commercial font GraphikArabic.

vk-github18 avatar Mar 14 '24 19:03 vk-github18

May be related to #938

asturio avatar Mar 16 '24 19:03 asturio

See also https://github.com/LibrePDF/OpenPDF/wiki/Accents,-DIN-91379,-non-Latin-scripts

vk-github18 avatar May 01 '24 11:05 vk-github18

Hi @vk-github18, I was able to print 0x0627 with GraphikArabic. I could not print 0x0FE8E. It seems that the library was converting characters from form A -> form B based on the surrounding characters in the ArabicLigatuizer.java. We can see 0x0627 is defined here: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L100

we can also see that 0x0627 is included in some row in charTable: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L132

and we can see charTable is used in two methods, that seem to be doing some sort of transormations: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L212-L253

So the question is:

  • When a font does not support 100% of the arabic characterset but does include the "base" character set (not familiar enough with arabic charactersets to use proper terms) should the library still try to convert the character into something that the font does not support?

Or even more generally speaking:

  • Should the openPDF library transform the given characters to another set of characters even if the resulting characters are not supported by the font?

An ArabicLiguatizer transformation bypass flag/option would make it so its up to the client to know the limitations of their font. At the moment, the only option is to switch fonts entirely.

bfryer-snap avatar May 01 '24 16:05 bfryer-snap

What is the result if you use import com.lowagie.text.pdf.LayoutProcessor; ... LayoutProcessor.enableKernLiga();

as explained in https://github.com/LibrePDF/OpenPDF/wiki/Accents,-DIN-91379,-non-Latin-scripts ?

vk-github18 avatar May 01 '24 19:05 vk-github18

To your question, if the commercial font you used does not support Arabic properly you should open an issue at the producer of the font. There is not much a library can do, if the font is not correct. To introduce special handling for incomplete/incorrect fonts is not the way to go. The transformations for arabic scripts are mandatory.

vk-github18 avatar May 01 '24 19:05 vk-github18

For a third option see https://github.com/LibrePDF/OpenPDF/wiki/Multi-byte-character-language-support-with-TTF-fonts

vk-github18 avatar May 04 '24 11:05 vk-github18

In my naive view, if you have a glyph substitution, you must replace "existing" glyph(s) with other "existing" glyph(s). So the solution could be checking if the target glyph(s) exists in the font, and throwing an Exception, saying that the font doesn't support the actually correct glyph(s). Would this behavior break any standard, @vk-github18 ?

asturio avatar May 08 '24 08:05 asturio

@asturio, sorry, I am not an expert, but I do have some opinions. There are three variants to layout arabic scripts in OpenPDF: 1. the old itext subtitution method, 2. using FOP, 3. using HarfBuzz via AWT and LayoutProcessor. Method 2 and 3 use the transformation tables inside the OpenType font. Only the first (old) variant was tested by @bfryer-snap . Having an optional "strict" layout mode, throwing an exception instead of producing wrong layout seems a valid option for me. However I would plead for deprecating the old itext substitution instead of investing more effort into it, and use the OpenType glyph substitution, ordering and positioning methods.

vk-github18 avatar May 08 '24 18:05 vk-github18

So only supporting FOP and HarfBuzz in AWT, @vk-github18 . Understood. Deprecation should be of ArabicLigaturizer, right?

@bfryer-snap , can you try one of the alternative methods, and give some feedback, if that work better?

  • Use LayoutProcessor
  • Or using FOP.

asturio avatar May 27 '24 16:05 asturio

@asturio I think, that ArabicLigaturizer should be deprecated, because no one is maintaining it. May be, not deprecated, but not invest effort into maintaining it. With FOP, and Harfbuzz via AWT in Java, OpenPdf would use existing solutions and not try to maintain an own solution, only the adapter has to be maintained.

vk-github18 avatar May 27 '24 21:05 vk-github18