A lot of additional blanks will be generated

Open mueller-jens opened this issue 5 years ago • 1 comments

I try to convert a html file to docx using the library. If i try it every blank in the tempate will be converted in a blank in the dockument. I used a template like

       String html="    <html><body><b>Type:</b> <span style='font-size: 10.0pt; font-family: \"Arial\", \"sans-serif\"'>TEXT</span>\n" + 
            "            <br/>\n" + 
            "            <span style='font-size: 10.0pt; font-family: \"Arial\", \"sans-serif\"'>\n" + 
            "               <b> another text: </b><span>10.0</span>\n" + 
            "            </span></body></html>";

        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
        
        XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
         
        wordMLPackage.getMainDocumentPart().getContent().addAll( 
            XHTMLImporter.convert( html, null) );
        String docx = XmlUtils.marshaltoString(wordMLPackage
                .getMainDocumentPart().getJaxbElement(), true, true);

        FileOutputStream outputStream = new FileOutputStream("C:/jmu/tmp/generated.docx");
        Save saver = new Save(wordMLPackage); 
        saver.save(outputStream);

And the result looks like:

Type: TEXT             
                              another text: 10.0

expected:

Type: TEXT
another text: 10.0

Apr 16 '20 07:04 mueller-jens

I had the same issue. Also when using an img-Tag.

Looking into the generated docx, I found that attribute space="preserve" seems to be the reason. This attribute is added in XHTMLImporterImpl.java.

I argue to remove this hardcoded "preserve" or make it configurable because whitespace in XML and HTML is ignored in most cases. If one really wants space in unusual places, one could use a non-breaking-space.

Anyway, my workaround is to remove the "preserve" arribute from the generated content:

    private static void removeSpacePreserveRecursive(Object obj)
    {
        if (obj instanceof Text)
        {
            var text = (Text) obj;
            if ("preserve".equals(text.getSpace()))
            {
                text.setSpace(null);
            }
        }
        else if (obj instanceof ContentAccessor)
        {
            ContentAccessor contentAccessor = (ContentAccessor) obj;
            for (Object child : contentAccessor.getContent())
            {
                removeSpacePreserveRecursive(child);
            }
        }
    }

You can call this method, for example, on wordMLPackage.getMainDocumentPart().getJaxbElement().getBody().

Jan 06 '21 08:01 achimmihca