odftoolkit icon indicating copy to clipboard operation
odftoolkit copied to clipboard

OdfTableCell.getDisplayText() includes comment timestamp and text.

Open ttaomae opened this issue 2 years ago • 3 comments

If an ODS spreadsheet cell contains a comment, the OdfTableCell.getDisplayText() method returns a string which looks something like: "2023-05-22T00:00:00CommentCell" where I assume the timestamp is the time of the comment, "Comment" is the comment text and "Cell" is the cell text.

Is this intentional? And if so, is there any way to obtain just the cell text?

ttaomae avatar May 22 '23 20:05 ttaomae

Hi, the easiest way is to create a simple test document (with a comment and cell text) and alter an existing test to debug what the code does! I believe you are on the right track - I must admit I do not have such a good memory and this would be exactly the way I would proceed - perhaps also grep on the odfdom test folder for cell (or comment) to see if there any test already working on this.

Happy hunting, you might add the result/findings or additional questions... (or perhaps others have something to add)

svanteschubert avatar May 24 '23 12:05 svanteschubert

I experimented with the OdfTableCell API and I wasn't able to find anything that does what I need. I also tried searching and didn't find any tests related to comments or annotations (since comments are represented as an <office:annotation>).

Based on the implementation of OdfTableCell.getDisplayText(), I wrote the following method which does what I need in the cases I've tested so far.

static String getCellText(OdfTableCell cell)
{
    var result = new StringBuilder();
    var whitespaceProcesser = new OdfWhitespaceProcessor();
    var nodes = cell.getOdfElement().getChildNodes();

    for (int i = 0; i < nodes.getLength(); i++) {
        var node = nodes.item(i);
        // Ignore comments.
        if (!(node instanceof OfficeAnnotationElement)) {
            // Add a line break before new paragraphs.
            if (result.length() != 0 &&node instanceof OdfTextParagraph) {
                result.append("\n");
            }
            result.append(whitespaceProcesser.getText(node));
        }
    }

    return result.toString();
}

It is still not clear to me if getDisplayText() is behaving as expected. The docs say that it returns "the text displayed in this cell." Which I would consider to be inaccurate or at least misleading for a few reasons.

  • I would argue that the comment is not technically "displayed in [the] cell".
  • I don't think the timestamp is displayed at all. At least in the version of LibreOffice Calc that I am running.
  • Even if there is no comment, the result does not match the cell text when there are multiple paragraphs since it doesn't include line breaks between paragraphs.

ttaomae avatar May 27 '23 00:05 ttaomae

I wonder why the method is being called getDisplayText() and not getTextContent()?

This getDisplayText() method https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/doc/table/OdfTableCell.java#L697 is calling https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/incubator/doc/text/OdfWhitespaceProcessor.java#L49 which incorrectly considers only children and no descendands.

As stated in #229: OdfElement has the base functionality to concatenate the text content: https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/pkg/OdfElement.java#L2633 but the every text node containing element like OdfTextSpan should override this method and define its specific behavior. By this method implementation, the specific behavior.

Finally, there is some third funcationality in https://github.com/tdf/odftoolkit/blob/master/odfdom/src/main/java/org/odftoolkit/odfdom/incubator/doc/text/OdfTextExtractor.java

These approaches should (and will) be harmonized to avoid duplicated implementations.

svanteschubert avatar May 30 '23 12:05 svanteschubert