OpenPDF
OpenPDF copied to clipboard
ColumnText + Arabic Reshaping causes arabic characters to no longer appear. Removing column text makes characters appear.
Describe the bug The arabic reshaping is leading to characters not being rendered in the PDF when using some fonts. If I do not use the ColumnText, the characters appear.
To Reproduce
We can use a modified version of the RightToLeft.java
example to show the issue:
Here it is working:
public static void main(String[] args) {
try {
// step 1
Document document = new Document(PageSize.A4, 50, 50, 50, 50);
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
// step 3
document.open();
// step 4
PdfContentByte cb = writer.getDirectContent();
// Font can be found here:
// https://fonts.google.com/noto/specimen/Noto+Sans+Arabic?sort=popularity&subset=arabic
BaseFont bf = BaseFont.createFont("NotoSansArabic-regular.ttf", BaseFont.IDENTITY_H, true);
ColumnText ct = new ColumnText(cb);
ct.setSimpleColumn(100, 100, 500, 800, 24, Element.ALIGN_LEFT);
ct.setSpaceCharRatio(PdfWriter.NO_SPACE_CHAR_RATIO);
ct.setLeading(0, 1);
ct.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
ct.setAlignment(Element.ALIGN_CENTER);
ct.addText(new Chunk(ar1, new Font(bf, 16)));
ct.addText(new Chunk(ar2, new Font(bf, 16, Font.NORMAL, Color.red)));
ct.go();
ct.setAlignment(Element.ALIGN_JUSTIFIED);
ct.addText(new Chunk(ar3, new Font(bf, 12)));
ct.go();
ct.setAlignment(Element.ALIGN_CENTER);
ct.addText(new Chunk(ar4, new Font(bf, 14)));
ct.go();
// step 5
document.close();
} catch (Exception e) {
e.printStackTrace();
}
}
/**
* arabic text
*/
public static String ar1 = "\u0623\u0648\u0631\u0648\u0628\u0627, \u0628\u0631\u0645\u062c\u064a\u0627\u062a "
+ "\u0627\u0644\u062d\u0627\u0633\u0648\u0628 + \u0627\u0646\u062a\u0631\u0646\u064a\u062a :\n\n";
/**
* arabic text
*/
public static String ar2 = "\u062a\u0635\u0628\u062d \u0639\u0627\u0644\u0645\u064a\u0627 \u0645\u0639 "
+ "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";
/**
* arabic text
*/
public static String ar3 = "\u062a\u0633\u062c\u0651\u0644 \u0627\u0644\u0622\u0646 \u0644\u062d\u0636\u0648\u0631 "
+ "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0627\u0644\u062f\u0648\u0644\u064a "
+ "\u0627\u0644\u0639\u0627\u0634\u0631 \u0644\u064a\u0648\u0646\u064a\u0643\u0648\u062f, "
+ "\u0627\u0644\u0630\u064a \u0633\u064a\u0639\u0642\u062f \u0641\u064a 10-12 \u0622\u0630\u0627\u0631 "
+ "1997 \u0628\u0645\u062f\u064a\u0646\u0629 \u0645\u0627\u064a\u0646\u062a\u0633, "
+ "\u0623\u0644\u0645\u0627\u0646\u064a\u0627. \u0648\u0633\u064a\u062c\u0645\u0639 "
+ "\u0627\u0644\u0645\u0624\u062a\u0645\u0631 \u0628\u064a\u0646 \u062e\u0628\u0631\u0627\u0621 "
+ "\u0645\u0646 \u0643\u0627\u0641\u0629 \u0642\u0637\u0627\u0639\u0627\u062a "
+ "\u0627\u0644\u0635\u0646\u0627\u0639\u0629 \u0639\u0644\u0649 \u0627\u0644\u0634\u0628\u0643\u0629 "
+ "\u0627\u0644\u0639\u0627\u0644\u0645\u064a\u0629 \u0627\u0646\u062a\u0631\u0646\u064a\u062a "
+ "\u0648\u064a\u0648\u0646\u064a\u0643\u0648\u062f, \u062d\u064a\u062b \u0633\u062a\u062a\u0645, "
+ "\u0639\u0644\u0649 \u0627\u0644\u0635\u0639\u064a\u062f\u064a\u0646 "
+ "\u0627\u0644\u062f\u0648\u0644\u064a \u0648\u0627\u0644\u0645\u062d\u0644\u064a \u0639\u0644\u0649 "
+ "\u062d\u062f \u0633\u0648\u0627\u0621 \u0645\u0646\u0627\u0642\u0634\u0629 \u0633\u0628\u0644 "
+ "\u0627\u0633\u062a\u062e\u062f\u0627\u0645 \u064a\u0648\u0646\u0643\u0648\u062f \u0641\u064a "
+ "\u0627\u0644\u0646\u0638\u0645 \u0627\u0644\u0642\u0627\u0626\u0645\u0629 "
+ "\u0648\u0641\u064a\u0645\u0627 \u064a\u062e\u0635 "
+ "\u0627\u0644\u062a\u0637\u0628\u064a\u0642\u0627\u062a "
+ "\u0627\u0644\u062d\u0627\u0633\u0648\u0628\u064a\u0629, \u0627\u0644\u062e\u0637\u0648\u0637, "
+ "\u062a\u0635\u0645\u064a\u0645 \u0627\u0644\u0646\u0635\u0648\u0635 "
+ "\u0648\u0627\u0644\u062d\u0648\u0633\u0628\u0629 \u0645\u062a\u0639\u062f\u062f\u0629 "
+ "\u0627\u0644\u0644\u063a\u0627\u062a.\n\n";
/**
* arabic text
*/
public static String ar4 = "ع\u0646\u062f\u0645\u0627 \u064a\u0631\u064a\u062f "
+ "\u0627\u0644\u0639\u0627\u0644\u0645 \u0623\u0646 \u064a\u062a\u0643\u0644\u0651\u0645, "
+ "\u0641\u0647\u0648 \u064a\u062a\u062d\u062f\u0651\u062b \u0628\u0644\u063a\u0629 "
+ "\u064a\u0648\u0646\u064a\u0643\u0648\u062f\n\n";
Output:
If we leave everything the exact same as the above, except we change "NotoSansArabic-regular.ttf"
to a different font, such as "GraphikArabic-Regular.ttf"
, Then we get the following output:
The problem can be seen most easily by looking at the lower left section of the main paragaph. In the NotoSansArabic
font, we can see a word that looks like it expands multiple characters. In the GraphikArabic
font, we can see that it is missing the right half of the word and seems to only contain the last two characters.
A specific character that seems to be rendered by NotoSansArabic
and not GraphikArabic
is \u0627
.
I thought that GraphikArabic
was missing the \u0627
character altogether, but if I use the following code, i can generate it just fine:
public static void main(String[] args) {
try {
// step 1
Document document = new Document(PageSize.A4, 50, 50, 50, 50);
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("sample.pdf"));
// step 3
document.open();
// step 4
PdfContentByte cb = writer.getDirectContent();
BaseFont bf = BaseFont.createFont("GraphikArabic-regular.ttf", BaseFont.IDENTITY_H, true);
Font font = new Font(bf);
document.add(new Paragraph("\u0627 \u0627 \u0627", font));
// step 5
document.close();
} catch (Exception e) {
e.printStackTrace();
}
}
screenshot of the output being as expected
System.out.println(bf.charExists('\u0627'));
also outputs true when using the GraphikArabic
font. I assume that BaseFont::charExists(char)
is the way to determine if the given char should show on the PDF.
I believe the issue is that characters like \u0627
are being reshaped into much different characters in a whole other unicode block and that the font does not support the reshaped characters. I believe this because when debugging, I can see that some characters such as 0x0627
become 0x0FE8E
. This transformation happens here: https://github.com/LibrePDF/OpenPDF/blob/master/openpdf/src/main/java/com/lowagie/text/pdf/BidiLine.java#L197
Expected behavior
I expect ColumnText
and adding elements to a document directly to have the same output OR I expect to be able to skip the "reshaping" process so that I can continue to use a font which supports the 0x0600 to 0x06FF character range.
Screenshots screenshots added above.
System (please complete the following information):
- OS: MacOS Ventura 13.4.1
- Used Font: NotoSansArabic-Regular.ttf, GraphikArabic-Regular.ttf
Additional context
Thank you for reporting this bug. Please submit a pull request with a solution to this problem if you can.
Neither 0x0627 nor 0x0FE8E can be found in BidiLine.java
Seems to be a problem with the commercial font GraphikArabic.
May be related to #938
See also https://github.com/LibrePDF/OpenPDF/wiki/Accents,-DIN-91379,-non-Latin-scripts
Hi @vk-github18, I was able to print 0x0627 with GraphikArabic. I could not print 0x0FE8E. It seems that the library was converting characters from form A -> form B based on the surrounding characters in the ArabicLigatuizer.java. We can see 0x0627 is defined here: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L100
we can also see that 0x0627
is included in some row in charTable
: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L132
and we can see charTable
is used in two methods, that seem to be doing some sort of transormations: https://github.com/LibrePDF/OpenPDF/blob/1.3-java8/openpdf/src/main/java/com/lowagie/text/pdf/ArabicLigaturizer.java#L212-L253
So the question is:
- When a font does not support 100% of the arabic characterset but does include the "base" character set (not familiar enough with arabic charactersets to use proper terms) should the library still try to convert the character into something that the font does not support?
Or even more generally speaking:
- Should the openPDF library transform the given characters to another set of characters even if the resulting characters are not supported by the font?
An ArabicLiguatizer
transformation bypass flag/option would make it so its up to the client to know the limitations of their font. At the moment, the only option is to switch fonts entirely.
What is the result if you use
import com.lowagie.text.pdf.LayoutProcessor; ... LayoutProcessor.enableKernLiga();
as explained in https://github.com/LibrePDF/OpenPDF/wiki/Accents,-DIN-91379,-non-Latin-scripts ?
To your question, if the commercial font you used does not support Arabic properly you should open an issue at the producer of the font. There is not much a library can do, if the font is not correct. To introduce special handling for incomplete/incorrect fonts is not the way to go. The transformations for arabic scripts are mandatory.
For a third option see https://github.com/LibrePDF/OpenPDF/wiki/Multi-byte-character-language-support-with-TTF-fonts
In my naive view, if you have a glyph substitution, you must replace "existing" glyph(s) with other "existing" glyph(s). So the solution could be checking if the target glyph(s) exists in the font, and throwing an Exception, saying that the font doesn't support the actually correct glyph(s). Would this behavior break any standard, @vk-github18 ?
@asturio, sorry, I am not an expert, but I do have some opinions. There are three variants to layout arabic scripts in OpenPDF: 1. the old itext subtitution method, 2. using FOP, 3. using HarfBuzz via AWT and LayoutProcessor. Method 2 and 3 use the transformation tables inside the OpenType font. Only the first (old) variant was tested by @bfryer-snap . Having an optional "strict" layout mode, throwing an exception instead of producing wrong layout seems a valid option for me. However I would plead for deprecating the old itext substitution instead of investing more effort into it, and use the OpenType glyph substitution, ordering and positioning methods.
So only supporting FOP and HarfBuzz in AWT, @vk-github18 . Understood. Deprecation should be of ArabicLigaturizer, right?
@bfryer-snap , can you try one of the alternative methods, and give some feedback, if that work better?
- Use LayoutProcessor
- Or using FOP.
@asturio I think, that ArabicLigaturizer should be deprecated, because no one is maintaining it. May be, not deprecated, but not invest effort into maintaining it. With FOP, and Harfbuzz via AWT in Java, OpenPdf would use existing solutions and not try to maintain an own solution, only the adapter has to be maintained.