A lot of additional blanks will be generated
I try to convert a html file to docx using the library. If i try it every blank in the tempate will be converted in a blank in the dockument. I used a template like
String html=" <html><body><b>Type:</b> <span style='font-size: 10.0pt; font-family: \"Arial\", \"sans-serif\"'>TEXT</span>\n" +
" <br/>\n" +
" <span style='font-size: 10.0pt; font-family: \"Arial\", \"sans-serif\"'>\n" +
" <b> another text: </b><span>10.0</span>\n" +
" </span></body></html>";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( html, null) );
String docx = XmlUtils.marshaltoString(wordMLPackage
.getMainDocumentPart().getJaxbElement(), true, true);
FileOutputStream outputStream = new FileOutputStream("C:/jmu/tmp/generated.docx");
Save saver = new Save(wordMLPackage);
saver.save(outputStream);
And the result looks like:
Type: TEXT
another text: 10.0
expected:
Type: TEXT another text: 10.0
I had the same issue. Also when using an img-Tag.
Looking into the generated docx, I found that attribute space="preserve" seems to be the reason. This attribute is added in XHTMLImporterImpl.java.
I argue to remove this hardcoded "preserve" or make it configurable because whitespace in XML and HTML is ignored in most cases. If one really wants space in unusual places, one could use a non-breaking-space.
Anyway, my workaround is to remove the "preserve" arribute from the generated content:
private static void removeSpacePreserveRecursive(Object obj)
{
if (obj instanceof Text)
{
var text = (Text) obj;
if ("preserve".equals(text.getSpace()))
{
text.setSpace(null);
}
}
else if (obj instanceof ContentAccessor)
{
ContentAccessor contentAccessor = (ContentAccessor) obj;
for (Object child : contentAccessor.getContent())
{
removeSpacePreserveRecursive(child);
}
}
}
You can call this method, for example, on wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().