odftoolkit icon indicating copy to clipboard operation
odftoolkit copied to clipboard

Incorrect behavior of OdfElement.getTextContent()

Open ttaomae opened this issue 2 years ago • 14 comments

While investigating #220, I came across OdfTableCell.getOdfElement().getTextContent(), which is defined by org.w3c.dom.Node.getTextContent() and implemented in OdfElement. My interpretation of the Node.getTextContent() documentation is that it should recursively retrieve the text contents of all child nodes and concatenate them together. However, the OdfElement implementation only looks at the children, rather than all descendants.

I think it is also worth noting that OdfWhitespaceParser.getText(Node) appears to do what OdfElement.getTextContent() is supposed to do. Or at least something very similar.

ttaomae avatar May 27 '23 00:05 ttaomae

i don't know the ODFDOM code well, but as a general rule about ODF: inside of a paragraph (text:p/text:h) and its child elements, there is character content which is part of the paragraph text (most elements), and there is character content which is part of some nested object and not part of the paragraph text.

there are a bunch of elements for drawing shapes, draw:frame etc. and also office:annotation, that should comprise most of the latter case - these should probably not be included in any API that promises some sort of text content of "the paragraph" or "the table cell".

the criteria for what is considered paragraph text are in section "6.1.2 White Space Characters" of ODF 1.3 part 3, and i think the text of a table cell should simply be the concatenation of its paragraphs.

mistmist avatar May 30 '23 09:05 mistmist

You are absolutely right, Todd.

OdfElement has the base functionality to concatenate the text content: https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/pkg/OdfElement.java#L2633

but the every text node containing element like OdfTextSpan should override this method and define its specific behavior. By this method implementation, the specific behavior - Michael has pointed out it his previous comment - would be reflected.

I am uncertain atm, if we have a way in the ODF Specification to differentiate this behavior, if not it would be worth to add it.

Taking this over issue over! Will aim to generate this as part of the upcoming release candidate.

Thanks for pointing out!

svanteschubert avatar May 30 '23 10:05 svanteschubert

The ODF specification explains in the bullet list of 6.1.1 all elements containing text that should not be considered, see https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#a_6_1_Basic_Text_Content

The funcationality is currently spread and needs to harmonized, e.g. this whitespace collapsing function is implemented in the characters(..) method of ChangesFileSaxHandler https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/changes/ChangesFileSaxHandler.java#L2252

#220 and this #229 are quite related and will likely be fixed in one patch.

svanteschubert avatar May 30 '23 12:05 svanteschubert

I want to note that the following elements should also be considered:

  • text:page-number
  • text:page-count

(and likely all the text field elements in Writer).

I wonder whether a blacklisting approach would be possible i.e. don't extract text only for a subset classes of elements.

brlin-tw avatar Apr 18 '25 05:04 brlin-tw

@ttaomae

I think it is also worth noting that OdfWhitespaceParser.getText(Node) appears to do what OdfElement.getTextContent() is supposed to do.

Can you tell me where this method is defined? Searching "OdfWhitespaceParser" on Google only brought up this GH issue.

brlin-tw avatar Apr 18 '25 06:04 brlin-tw

@brlin-tw "OdfWhitespaceParser" might be a relic from the "Simple API", which was abandoned due to design issues. I was not able to refactor the API due to the heavily copied code of ODFDOM, which was duplicated, and the original parts were refactored at ODFDOM earlier.

Using a black or whitelist of what ODF elements the text content should be considered as text rendered to the end-user depends on what type exists more often. :-)

@mistmist Can we extract such a list from the ODF specification, or do we need additional metadata in the specification, and should I write an issue to add such data to our ODF specification?

svanteschubert avatar Apr 18 '25 08:04 svanteschubert

BTW, sometimes it is not that easy: When you ask for the text content of a paragraph, there might be an image part of that paragraph - its image data (XML) is within the XML of the paragraph, but visually the image is floating aside the paragraph. In this case, the title of that image is within the XML of the paragraph, but likely should not be part of the return of a getText function, right? This might require test documents and testing to find a good decision, which I am not able to spend time on atm, but I just wanted to point out this more complex edge case.

svanteschubert avatar Apr 18 '25 08:04 svanteschubert

@brlin-tw, sorry I think I meant write OdfWhitespaceProcessor.

ttaomae avatar Apr 18 '25 08:04 ttaomae

@svanteschubert @ttaomae

Thanks for the information! I ended up implementing a recursive function with the following logic to (roughly) skip the elements that should be excluded:

// Exclude the elements specified in the section 6.1.1 of the ODF schema specification
// https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#__RefHeading__1415198_253892949
if (child.getNodeName().equals("note:body")
    || child.getNodeName().equals("text:ruby-text")
    || child.getNodeName().equals("office:annotation")
    || child.getNodeName().startsWith("draw:")
    || child.getNodeName().startsWith("svg:")
    || child.getNodeName().equals("office:event-listeners")
    || child.getNodeName().equals("office:binary-data")
    || child.getNodeName().startsWith("dr3d:")) {
    continue;
}

brlin-tw avatar Apr 18 '25 08:04 brlin-tw

PS: Michael just pointed out the following chapter of the spec, which is of interest for our task at hand:

6.1.1 General The paragraph elements text:p and text:h and their descendant elements contain the text content of any document. The character content of a paragraph consists of the character data of the paragraph element and the character data of its descendant elements concatenated in document order, with the following exceptions: Character data contained in the following elements or their descendant elements are not included in the character content of a paragraph: • text:note-body. 6.3.4. • text:ruby-text. 7.5.19 • office:annotation. 14.1 • Drawing shape and frame elements defined in sections 10.3, 10.4, 10.5 and 10.6.

https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#character-content

svanteschubert avatar Apr 28 '25 12:04 svanteschubert

@brlin-tw Would it be possible if you could share your solution as a patch? I just ran into the same problem, where the line breaks of a pretty-printing of a table:table-cell was taking into account of the cell content, which is not the case, as all whitespace should be ignored if not part of an element listed in 6.1.1 (or some descendant element of them) - there should be a state being provided in the recursion!

svanteschubert avatar May 13 '25 11:05 svanteschubert

PS: If there are two places where whitespace handling takes place, we should identify them and consider a merge, but those implementations might be working on different models like SAX and DOM...

svanteschubert avatar May 13 '25 11:05 svanteschubert

Tests/Assertions have been disabled due to this issue and should be enabled again with the fixing patch:

Search for "229" in the following files to find these spots:

  • https://github.com/svanteschubert/odftoolkit/blob/FormatCodeTests/odfdom/src/test/java/org/odftoolkit/odfdom/doc/table/TableCellTest.java#L212
  • https://github.com/svanteschubert/odftoolkit/blob/FormatCodeTests/odfdom/src/test/java/org/odftoolkit/odfdom/doc/table/TableTest.java#L916

svanteschubert avatar May 13 '25 12:05 svanteschubert

@svanteschubert

Could you share your solution as a patch?

The implementation is part of my employer(OSS Integral Institute Co., Ltd.)'s product. After inquiring with my employer, they said it is fine to release it using the same legal terms of ODFDOM, so here's the code:

    // Extract the text content from an element
    private static String extractTextContent(Element element) {
        StringBuilder text = new StringBuilder();
        NodeList children = element.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
            Node child = children.item(i);
            if (child.getNodeType() == Node.TEXT_NODE) {
                text.append(child.getNodeValue());
            } else if (child.getNodeType() == Node.ELEMENT_NODE) {
                // Exclude the elements specified in the section 6.1.1 of ODF schema specification
                // https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#__RefHeading__1415198_253892949
                if (child.getNodeName().equals("note:body")
                    || child.getNodeName().equals("text:ruby-text")
                    || child.getNodeName().equals("office:annotation")
                    || child.getNodeName().startsWith("draw:")
                    || child.getNodeName().startsWith("svg:")
                    || child.getNodeName().equals("office:event-listeners")
                    || child.getNodeName().equals("office:binary-data")
                    || child.getNodeName().startsWith("dr3d:")) {
                    continue;
                }
                text.append(extractTextContent((Element) child));
            }
        }
        return text.toString().trim();
    }

Source attribution: Buo-ren Lin (OSSII) <[email protected]>

I don't think I have the expertise or resources to directly contribute to the project at this time, feel free to take it on.

brlin-tw avatar May 14 '25 02:05 brlin-tw