sparql.anything icon indicating copy to clipboard operation
sparql.anything copied to clipboard

How to get comments from DOCX?

Open luigi-asprino opened this issue 1 year ago • 1 comments

Discussed in https://github.com/SPARQL-Anything/sparql.anything/discussions/430

Originally posted by kvistgaard November 18, 2023 From what I tried so far, it seems they are not accessible. Yet, since they are what I mostly need to get from MS Word documents, I'm hoping that there is a way (I saw such an option for spreadsheets) or that it can be implemented.

luigi-asprino avatar Dec 15 '23 08:12 luigi-asprino

@luigi-asprino , any updates on that?

kvistgaard avatar Apr 17 '24 13:04 kvistgaard

7912bb9 implements the extension to extract comment documents. Comments are interpreted as containers with three slots containing the id, the author and the text of the comment. Comment Containers are attached to the paragraph the comment refers to.

See this docx and its RDF counterpart

luigi-asprino avatar Aug 05 '24 14:08 luigi-asprino

@luigi-asprino excellent, I'll give it a try very soon. At a first glance it's not obvious how a comment is linked to what it is a comment on, and the thread: commentY isResponseTo comment commentX

kvistgaard avatar Aug 05 '24 14:08 kvistgaard

7912bb9 implements the extension to extract comment documents.

Now I see that it's for 1.0. I've been trying with the latest release 0.9.0. When will it be released?

kvistgaard avatar Aug 05 '24 15:08 kvistgaard

You can try it out with the pre-release v1.0-DEV.4 that has just been created.

https://github.com/SPARQL-Anything/sparql.anything/releases/tag/v1.0-DEV.4

luigi-asprino avatar Aug 06 '24 08:08 luigi-asprino

Thanks. Just tested it. Works great. Excellent work. Do you have any thoughts on the threads?

kvistgaard avatar Aug 06 '24 09:08 kvistgaard

I am reopening it to try to make the comments thread clearer.

luigi-asprino avatar Aug 10 '24 11:08 luigi-asprino

At the moment, comments in the same thread are attached as subsequent slots of the container for the paragraph.

Suppose you have a paragraph "Paragraph1" with two comments ("This is a comment" and "This is a reply").

This results into two slots 2 and 3 referencing the comments

<http://www.example.org/document/paragraph/2>
        rdf:type  xyz:Paragraph;
        rdf:_1    "Paragraph1";
        rdf:_2    <http://www.example.org/document/Comment_0>;
        rdf:_3    <http://www.example.org/document/Comment_1> .

<http://www.example.org/document/Comment_1>
        rdf:type  xyz:Comment;
        rdf:_1    <http://www.example.org/document/Comment_1/Author>;
        rdf:_2    <http://www.example.org/document/Comment_1/CommentText>;
        rdf:_3    <http://www.example.org/document/Comment_1/CommentId>.

<http://www.example.org/document/Comment_1/CommentId>
        rdf:type  xyz:CommentId;
        rdf:_1    "1" .

<http://www.example.org/document/Comment_1/CommentText>
        rdf:type  xyz:CommentText;
        rdf:_1    "This is a reply" .

<http://www.example.org/document/Comment_1/Author>
        rdf:type  xyz:CommentAuthor;
        rdf:_1    "Luigi Asprino" .


<http://www.example.org/document/Comment_0>
        rdf:type  xyz:Comment;
        rdf:_1    <http://www.example.org/document/Comment_0/Author>;
        rdf:_2    <http://www.example.org/document/Comment_0/CommentText>;
        rdf:_3    <http://www.example.org/document/Comment_0/CommentId>.


<http://www.example.org/document/Comment_0/CommentId>
        rdf:type  xyz:CommentId;
        rdf:_1    "0" .

<http://www.example.org/document/Comment_0/CommentText>
        rdf:type  xyz:CommentText;
        rdf:_1    "This is a comment" .

<http://www.example.org/document/Comment_0/Author>
        rdf:type  xyz:CommentAuthor;
        rdf:_1    "Luigi Asprino" .

A possible solution would be adding the thread comment number as a slot of the comment.

<http://www.example.org/document/Comment_1>
        rdf:type  xyz:Comment;
        rdf:_1    <http://www.example.org/document/Comment_1/Author>;
        rdf:_2    <http://www.example.org/document/Comment_1/CommentText>;
        rdf:_3    <http://www.example.org/document/Comment_1/CommentId>;
        rdf:_4    <http://www.example.org/document/Comment_1/ThreadCommentNumber> .

<http://www.example.org/document/Comment_1/ThreadCommentNumber>
        rdf:type  xyz:ThreadCommentNumber;
        rdf:_1    "2"^^xsd:int .

<http://www.example.org/document/Comment_0>
        rdf:type  xyz:Comment;
        rdf:_1    <http://www.example.org/document/Comment_0/Author>;
        rdf:_2    <http://www.example.org/document/Comment_0/CommentText>;
        rdf:_3    <http://www.example.org/document/Comment_0/CommentId>;
        rdf:_4    <http://www.example.org/document/Comment_0/ThreadCommentNumber> .

<http://www.example.org/document/Comment_0/ThreadCommentNumber>
        rdf:type  xyz:ThreadCommentNumber;
        rdf:_1    "1"^^xsd:int .


luigi-asprino avatar Aug 10 '24 11:08 luigi-asprino

I was imagining something more in the style of sioc:has_reply + sioc:Thread but I guess what you suggest would work equally well.

kvistgaard avatar Aug 10 '24 12:08 kvistgaard

The relationship between comments and their replies is implicit in the order of the comments. Therefore, sioc:has_reply + sioc:Thread can be materialised with a SPARQL construct if necessary. This is in line with the SPARQL Anything philosophy of using the minimum number of operations to transform data into RDF and leaving the transformation to the user.

luigi-asprino avatar Aug 12 '24 14:08 luigi-asprino

@luigi-asprino Currently, there is a document part (paragraph, heading) on which the comment is made that is nicely linked with the comment. Is there a way to extract also the highlighted part of the text of that item on which the comment is made?

kvistgaard avatar Sep 03 '24 15:09 kvistgaard