docxToJats Caption title parsing

Hi, thanks for the excellent work so far! I tested the current version of the converted and seems to work very well.

A minor issue, visible in the example files: Figure

1 In vel pellentesque est, eu placerat felis sed porttitor felis in aliquet imperdiet.

I would expect the label to contain also the number (which is now in the title). It could also be good to detect number+dot as a part of the label (you can find something like "Figure 1. Caption text", but you can drop the dot in the label).

Additionally, I did not go deeper into the code which handles captions, but would it be possible to also detect references (Zotero/Mendeley) there? In some cases people put references also in captions, and if the reference only appears in the caption, it will not be detected at all (i.e. appear in the references list at the end).

Thanks and continue the great work.

Dec 21 '21 15:12 markoban

Hi @markoban,

All three major applications that allow export in DOCX format don't handle captions as per OOXML standard documentation. Another problem is that captions in OOXML are unstructured.

The only possible way, which I'm considering, to handle such cases (including parsing unstructured references), is machine learning algorithms. More or less accurate results for this type of task (sequence labelling) can be achieved by using long short term memory networks or something like conditional random fields, both of which aren't implemented in PHP, although there are attempts, like: https://github.com/RubixML/ML/issues/129. I really don't want to introduce dependencies in other languages for this project and my options here are quite limited. Thus, I'm considering the following possibilities:

Continue to wait for LSTM network implementation in PHP
Integrate TensorFlow JS, which will significantly increase the traffic between server and client browser. Models can be trained in and saved with Python, and then loaded in the browser with JS (without external dependencies on a production OJS instance)
Implement the algorithm in PHP by myself, which, requires a lot of effort for a guy like me.

Additionally, I did not go deeper into the code which handles captions, but would it be possible to also detect references (Zotero/Mendeley) there? In some cases people put references also in captions, and if the reference only appears in the caption, it will not be detected at all (i.e. appear in the references list at the end).

Captions are detected only if, e.g., they are explicitly added as captions with MS Word or LibreOffice Writer. Does any of them support putting Zotero or Mendeley references there?

Dec 24 '21 14:12 Vitaliy-1

Just checked, at least LibreOffice + Mendeley Plugin allows adding citations in a caption if the document is exported as MS Word compatible with the option, which the Mendeley plugin adds.

Dec 24 '21 14:12 Vitaliy-1

Hi @Vitaliy-1,

Thanks for the explanation. The current implementation of handling captions generally seems to work fine, except that the number within the caption is transferred from label to title field in the final XML. I went to check in the source MS Word file I was using for testing and saw that this is coming from the way the caption (bookmark) is stored there. It's not a big nuance, though. If the authors correctly use captions and cross-reference option in the document, it should work fine.

Just checked, at least LibreOffice + Mendeley Plugin allows adding citations in a caption if the document is exported as MS Word compatible with the option, which the Mendeley plugin adds.

Does it also convert to xref in the final XML?

I was using MS Word with Zotero and the reference added within the caption (it only appears there), was not detected during the conversion to XML and was simply converted to text.

Will try a bit more with few other documents.

Dec 24 '21 14:12 markoban

Does it also convert to xref in the final XML?

I don't think I've implemented that but should be easy to add. I'll take a look.

I was using MS Word with Zotero and the reference added within the caption (it only appears there), was not detected during the conversion to XML and was simply converted to text.

I don't have MS Office installed. In LibreOffice it's:

<w:p>
    <w:pPr>
        <w:pStyle w:val="Table"/>
        <w:keepNext w:val="true"/>
        <w:bidi w:val="0"/>
        <w:spacing w:before="120" w:after="120"/>
        <w:jc w:val="left"/>
        <w:rPr></w:rPr>
    </w:pPr>
    <w:r>
        <w:rPr></w:rPr>
        <w:t xml:space="preserve">Table </w:t>
    </w:r>
    <w:r>
        <w:rPr></w:rPr>
        <w:fldChar w:fldCharType="begin"></w:fldChar>
    </w:r>
    <w:r>
        <w:rPr></w:rPr>
        <w:instrText>SEQ Table \* ARABIC</w:instrText>
    </w:r>
    <w:r>
        <w:rPr></w:rPr>
        <w:fldChar w:fldCharType="separate"/>
    </w:r>
    <w:r>
        <w:rPr></w:rPr>
        <w:t>1</w:t>
    </w:r>
    <w:r>
        <w:rPr></w:rPr>
        <w:fldChar w:fldCharType="end"/>
    </w:r>
    <w:r>
        <w:rPr></w:rPr>
        <w:t xml:space="preserve">: This is a table caption. </w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="Mendeley_Bookmark_zDXKpo7g2s"/>
    <w:bookmarkStart w:id="1" w:name="Mendeley_Bookmark_edlCspvzhW"/>
    <w:r>
        <w:rPr>
            <w:position w:val="0"/>
            <w:sz w:val="24"/>
            <w:vertAlign w:val="baseline"/>
        </w:rPr>
        <w:t>(1)</w:t>
    </w:r>
    <w:bookmarkEnd w:id="0"/>
    <w:bookmarkEnd w:id="1"/>
</w:p>
<w:tbl>
....

As can be seen, caption here is just a paragraph styled as Table (which, in turn, is based on Caption style) - see pStyle tag inside w:pPr. It appears just before or after the table. The reference is pointed here with a bookmark tag with a specific id. Can you show how it looks in case of MS Word + Zotero in document.xml inside DOCX archive?

Dec 24 '21 15:12 Vitaliy-1

Hopefully this is the correct part of the MS Word doc :) it is not easy to get around with the tags there

<w:p w14:paraId="341C60CB" w14:textId="470EBF40" w:rsidR="00B90656" w:rsidRDefault="00B90656" w:rsidP="00B90656"><w:pPr><w:pStyle w:val="Caption"/><w:keepNext/></w:pPr><w:bookmarkStart w:id="0" w:name="_Ref90117101"/><w:r>
<w:t xml:space="preserve">Table </w:t></w:r><w:fldSimple w:instr=" SEQ Table \* ARABIC "><w:r><w:rPr><w:noProof/></w:rPr><w:t>1</w:t></w:r></w:fldSimple><w:bookmarkEnd w:id="0"/><w:r><w:t>. Some table caption</w:t></w:r><w:r w:rsidR="004C4814"><w:t xml:space="preserve"> </w:t></w:r><w:r w:rsidR="004C4814"><w:fldChar w:fldCharType="begin"/></w:r><w:r w:rsidR="004C4814"><w:instrText xml:space="preserve"> ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"Ew4pyAhV","properties":{"formattedCitation":"[2]","plainCitation":"[2]","noteIndex":0},"citationItems":[{"id":2,"uris":["http://zotero.org/users/7015192/items/itemID"],"uri":["http://zotero.org/users/7015192/items/ITEMID"],"itemData":{"id":2,"type":"article-journal","abstract":"Article by authors published","container-title":"Some journal","issue":"1","page":"1-19","source":"source.site","title":"Paper Title","volume":"9","author":[{"family":"Family 1","given":"Name 1"},{"family":"Family2","given":"Name 2"}],"issued":{"date-parts":[["2021",3,30]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"} </w:instrText></w:r><w:r w:rsidR="004C4814"><w:fldChar w:fldCharType="separate"/></w:r><w:r w:rsidR="004C4814" w:rsidRPr="004C4814"><w:rPr><w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Calibri"/></w:rPr><w:t>[2]</w:t></w:r><w:r w:rsidR="004C4814"><w:fldChar w:fldCharType="end"/></w:r></w:p>

Here is it prettyfied (hopefully no information lost):

<w:p w14:paraId="341C60CB" w14:textId="470EBF40" w:rsidR="00B90656" w:rsidRDefault="00B90656" w:rsidP="00B90656">
  <w:pPr>
    <w:pStyle w:val="Caption"/>
    <w:keepNext/>
  </w:pPr>
  <w:bookmarkStart w:id="0" w:name="_Ref90117101"/>
  <w:r>
    <w:t xml:space="preserve">Table </w:t>
  </w:r>
  <w:fldSimple w:instr=" SEQ Table \* ARABIC ">
    <w:r>
      <w:rPr>
        <w:noProof/>
      </w:rPr>
      <w:t>1</w:t>
    </w:r>
  </w:fldSimple>
  <w:bookmarkEnd w:id="0"/>
  <w:r>
    <w:t>. Some table caption</w:t>
  </w:r>
  <w:r w:rsidR="004C4814">
    <w:t xml:space="preserve"> </w:t>
  </w:r>
  <w:r w:rsidR="004C4814">
    <w:fldChar w:fldCharType="begin"/>
  </w:r>
  <w:r w:rsidR="004C4814">
    <w:instrText xml:space="preserve"> ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"Ew4pyAhV","properties":{"formattedCitation":"[2]","plainCitation":"[2]","noteIndex":0},"citationItems":[{"id":2,"uris":["http://zotero.org/users/7015192/items/itemID"],"uri":["http://zotero.org/users/7015192/items/ITEMID"],"itemData":{"id":2,"type":"article-journal","abstract":"Article by authors published","container-title":"Some journal","issue":"1","page":"1-19","source":"source.site","title":"Paper Title","volume":"9","author":[{"family":"Family 1","given":"Name 1"},{"family":"Family2","given":"Name 2"}],"issued":{"date-parts":[["2021",3,30]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"} </w:instrText>
  </w:r>
  <w:r w:rsidR="004C4814">
    <w:fldChar w:fldCharType="separate"/>
  </w:r>
  <w:r w:rsidR="004C4814" w:rsidRPr="004C4814">
    <w:rPr>
      <w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Calibri"/>
    </w:rPr>
    <w:t>[2]</w:t>
  </w:r>
  <w:r w:rsidR="004C4814">
    <w:fldChar w:fldCharType="end"/>
  </w:r>
</w:p>

The reference is there as a Zotero item which usually converts well to reference, but in this case it is skipped. Note that this reference does not appear anywhere else in the document (so basically a case where an author displays a table with data obtained elswhere)

It then converts to:

<label>Table 1</label>
          <caption>
            <title>. Some table caption [2]</title>
          </caption>

Dec 24 '21 15:12 markoban

After inspecting the code, the best option so far is:

identify and save the reference from a caption in the document when the caption is identified and set: https://github.com/Vitaliy-1/docxToJats/blob/30bb7c165d11832a21ea87769b9b368940fe58f1/src/docx2jats/objectModel/Document.php#L146, use the same approach as for references in paragraphs here: https://github.com/Vitaliy-1/docxToJats/blob/30bb7c165d11832a21ea87769b9b368940fe58f1/src/docx2jats/objectModel/Document.php#L154
Treat caption title as a Text object: https://github.com/Vitaliy-1/docxToJats/blob/30bb7c165d11832a21ea87769b9b368940fe58f1/src/docx2jats/objectModel/body/Table.php#L107 but don't allow to use of formatted text (in-text citation is a property of Text)
Don't include text runs included in Complex Field Character (see w:fldChar tag) into the caption title as, in the examples presented, they are pointing to the table/figure sequence number. It will fix the issue:

1 In vel pellentesque est, eu placerat felis sed porttitor felis in aliquet imperdiet.

Dec 29 '21 17:12 Vitaliy-1

I am slowly fighting my way through the code, and looking at it I wander if it would be possible to treat the caption as a paragraph object. As a simple test, I added a line to create a dummy paragraph when the caption is detected in the Document.php file. It wasn't used afterwards, but it did pick up the Zotero reference item and added it to the bibliography list. I wanted to pass the paragraph object further to the table object (as it is with Label and Title elements) but ran into some recursion issues and memory problems when getting to the part where the information should be parsed into XML.

As it is now, the caption title is treated as a plain text so it does not pick up any additional formatting either (we have quite a bit subscripts there - like in CO2, H2O, etc.).

As I see it, treating the Caption title as a paragraph (which it basically is in Word) might be the easiest way to handle all possible situations, and in the XML creation part to simply strip the paragraph tag, but to leave all the other elements found in it. This would also be in accordance with the JATS standard.

Jan 05 '22 14:01 markoban

I am slowly fighting my way through the code, and looking at it I wander if it would be possible to treat the caption as a paragraph object.

As I see it, treating the Caption title as a paragraph (which it basically is in Word) might be the easiest way to handle all possible situations

Yes, MS Word and LibreOffice Writer treat captions as a paragraph with a Caption style. But according to OOXML documentation, it should be caption element inside the table properties: http://officeopenxml.com/WPtableCaption.php It's allowed to have several paragraphs having a caption styling:

        <w:p>
            <w:pPr>
                <w:pStyle w:val="Caption"/>
                <w:keepNext w:val="true"/>
                <w:bidi w:val="0"/>
                <w:spacing w:before="120" w:after="120"/>
                <w:jc w:val="left"/>
                <w:rPr></w:rPr>
            </w:pPr>
            <w:r>
                <w:rPr></w:rPr>
                <w:t xml:space="preserve">Table </w:t>
            </w:r>
            <w:r>
                <w:rPr></w:rPr>
                <w:fldChar w:fldCharType="begin"></w:fldChar>
            </w:r>
            <w:r>
                <w:rPr></w:rPr>
                <w:instrText>SEQ Table \* ARABIC</w:instrText>
            </w:r>
            <w:r>
                <w:rPr></w:rPr>
                <w:fldChar w:fldCharType="separate"/>
            </w:r>
            <w:r>
                <w:rPr></w:rPr>
                <w:t>1</w:t>
            </w:r>
            <w:r>
                <w:rPr></w:rPr>
                <w:fldChar w:fldCharType="end"/>
            </w:r>
            <w:r>
                <w:rPr></w:rPr>
                <w:t>: This is a table caption</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Caption"/>
                <w:bidi w:val="0"/>
                <w:jc w:val="left"/>
                <w:rPr></w:rPr>
            </w:pPr>
            <w:r>
                <w:rPr></w:rPr>
                <w:t>This is another table caption that occupies another paragraph</w:t>
            </w:r>
        </w:p>

So, if stick with this widespread practice, we can assume a caption element consisting of several paragraphs. The first problem here is identifying the link between table/figure and its caption. For now, it's done by finding the nearest table or figure: the analogue of lookbehind on one element and then lookahead until table or figure is found. which isn't exactly the right algorithm.

The second problem here is extracting unstructured caption data, from these paragraphs with caption style. JATS supports label, title and description, presented as simple paragraph elements p. In the current state, just by parsing OOXML it's impossible to do and I hope to delegate this task to a model trained on specific ML algorithms.

As an intermediate solution, I decided to treat parsed data from captions as a table title. Although title according to JATS specification can include formatted text, links, etc. as a p element does, it's not supported by Texture (JATS Editor) and it overloads the document representation with redundant styling information. The link (e.g., to a reference) is a specific case and I think it should be allowed here. It's a part of a broader problem about simplifying JATS format for reusability, see, e.g., DAR subset of JATS XML.

So, answering the initial question, it's better not to allow in table/figure title the same elements, which are allowed in the paragraph. Let's allow only links to references there, as an exception, until introducing a valid approach of discriminating between caption title and description.

Jan 05 '22 16:01 Vitaliy-1

I wanted to pass the paragraph object further to the table object (as it is with Label and Title elements) but ran into some recursion issues and memory problems when getting to the part where the information should be parsed into XML

Maybe it's because the text inside paragraphs is parsed with a tricky strategy. I was forced to implement it, because of a different nature of text formating in OOXML and JATS XML/HTML worlds. The styling in OOXML is defined as a property of a text run element but in JATS XML it's defined by a tag (<italic>, ,<bold>).

Jan 05 '22 16:01 Vitaliy-1

So, if stick with this widespread practice, we can assume a caption element consisting of several paragraphs. The first problem here is identifying the link between table/figure and its caption. For now, it's done by finding the nearest table or figure: the analogue of lookbehind on one element and then lookahead until table or figure is found. which isn't exactly the right algorithm.

Generally, I think it is a good approach and would stick to it. I can see the problem arising only in a case when there are multiple figures under one caption (Fig 1a, b, c...) which have "sub-captions" but which are not defined with a captioning tool nor style. But I believe the caption would in this case be linked to the last of the images so no great loss there.

To keep it simple, and yet covering vast majority of cases, I would not process as captions paragraphs which do not have w:instrText tag inside with SEQ Table/Figure part (and possibly a bookmark reference code). I must admit, I have not yet come across a multiple paragraph caption in our papers.

For general usability, simply getting bibliographic references caught in the caption part and converted to xref ref-type="bibr" tag in the final XML within the caption title would be a huge step forward.

Jan 05 '22 18:01 markoban

@markoban, in theory, it should work but I haven't tested it yet. The changes for parsing references in table or figure caption are on this branch: https://github.com/Vitaliy-1/docxToJats/tree/i28_caption_refs. It's implemented only for MS Word for now.

Feb 20 '22 18:02 Vitaliy-1

Hi @Vitaliy-1

Thanks, I'll give it a go :)

Feb 21 '22 10:02 markoban

Is implemented

Jan 09 '23 11:01 Vitaliy-1

docxToJats docxToJats copied to clipboard

Caption title parsing

docxToJats
docxToJats copied to clipboard