docxToJats
docxToJats copied to clipboard
Caption title parsing
Hi, thanks for the excellent work so far! I tested the current version of the converted and seems to work very well.
A minor issue, visible in the example files:
I would expect the label to contain also the number (which is now in the title). It could also be good to detect number+dot as a part of the label (you can find something like "Figure 1. Caption text", but you can drop the dot in the label).
Additionally, I did not go deeper into the code which handles captions, but would it be possible to also detect references (Zotero/Mendeley) there? In some cases people put references also in captions, and if the reference only appears in the caption, it will not be detected at all (i.e. appear in the references list at the end).
Thanks and continue the great work.
Hi @markoban,
All three major applications that allow export in DOCX format don't handle captions as per OOXML standard documentation. Another problem is that captions in OOXML are unstructured.
The only possible way, which I'm considering, to handle such cases (including parsing unstructured references), is machine learning algorithms. More or less accurate results for this type of task (sequence labelling) can be achieved by using long short term memory networks or something like conditional random fields, both of which aren't implemented in PHP, although there are attempts, like: https://github.com/RubixML/ML/issues/129. I really don't want to introduce dependencies in other languages for this project and my options here are quite limited. Thus, I'm considering the following possibilities:
- Continue to wait for LSTM network implementation in PHP
- Integrate TensorFlow JS, which will significantly increase the traffic between server and client browser. Models can be trained in and saved with Python, and then loaded in the browser with JS (without external dependencies on a production OJS instance)
- Implement the algorithm in PHP by myself, which, requires a lot of effort for a guy like me.
Additionally, I did not go deeper into the code which handles captions, but would it be possible to also detect references (Zotero/Mendeley) there? In some cases people put references also in captions, and if the reference only appears in the caption, it will not be detected at all (i.e. appear in the references list at the end).
Captions are detected only if, e.g., they are explicitly added as captions with MS Word or LibreOffice Writer. Does any of them support putting Zotero or Mendeley references there?
Just checked, at least LibreOffice + Mendeley Plugin allows adding citations in a caption if the document is exported as MS Word compatible
with the option, which the Mendeley plugin adds.
Hi @Vitaliy-1,
Thanks for the explanation. The current implementation of handling captions generally seems to work fine, except that the number within the caption is transferred from label to title field in the final XML. I went to check in the source MS Word file I was using for testing and saw that this is coming from the way the caption (bookmark) is stored there. It's not a big nuance, though. If the authors correctly use captions and cross-reference option in the document, it should work fine.
Just checked, at least LibreOffice + Mendeley Plugin allows adding citations in a caption if the document is exported as MS Word compatible with the option, which the Mendeley plugin adds.
Does it also convert to xref in the final XML?
I was using MS Word with Zotero and the reference added within the caption (it only appears there), was not detected during the conversion to XML and was simply converted to text.
Will try a bit more with few other documents.
Does it also convert to xref in the final XML?
I don't think I've implemented that but should be easy to add. I'll take a look.
I was using MS Word with Zotero and the reference added within the caption (it only appears there), was not detected during the conversion to XML and was simply converted to text.
I don't have MS Office installed. In LibreOffice it's:
<w:p>
<w:pPr>
<w:pStyle w:val="Table"/>
<w:keepNext w:val="true"/>
<w:bidi w:val="0"/>
<w:spacing w:before="120" w:after="120"/>
<w:jc w:val="left"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r>
<w:rPr></w:rPr>
<w:t xml:space="preserve">Table </w:t>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:fldChar w:fldCharType="begin"></w:fldChar>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:instrText>SEQ Table \* ARABIC</w:instrText>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:t>1</w:t>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:t xml:space="preserve">: This is a table caption. </w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="Mendeley_Bookmark_zDXKpo7g2s"/>
<w:bookmarkStart w:id="1" w:name="Mendeley_Bookmark_edlCspvzhW"/>
<w:r>
<w:rPr>
<w:position w:val="0"/>
<w:sz w:val="24"/>
<w:vertAlign w:val="baseline"/>
</w:rPr>
<w:t>(1)</w:t>
</w:r>
<w:bookmarkEnd w:id="0"/>
<w:bookmarkEnd w:id="1"/>
</w:p>
<w:tbl>
....
As can be seen, caption here is just a paragraph styled as Table
(which, in turn, is based on Caption
style) - see pStyle
tag inside w:pPr
. It appears just before or after the table. The reference is pointed here with a bookmark
tag with a specific id.
Can you show how it looks in case of MS Word + Zotero in document.xml
inside DOCX archive?
Hopefully this is the correct part of the MS Word doc :) it is not easy to get around with the tags there
<w:p w14:paraId="341C60CB" w14:textId="470EBF40" w:rsidR="00B90656" w:rsidRDefault="00B90656" w:rsidP="00B90656"><w:pPr><w:pStyle w:val="Caption"/><w:keepNext/></w:pPr><w:bookmarkStart w:id="0" w:name="_Ref90117101"/><w:r>
<w:t xml:space="preserve">Table </w:t></w:r><w:fldSimple w:instr=" SEQ Table \* ARABIC "><w:r><w:rPr><w:noProof/></w:rPr><w:t>1</w:t></w:r></w:fldSimple><w:bookmarkEnd w:id="0"/><w:r><w:t>. Some table caption</w:t></w:r><w:r w:rsidR="004C4814"><w:t xml:space="preserve"> </w:t></w:r><w:r w:rsidR="004C4814"><w:fldChar w:fldCharType="begin"/></w:r><w:r w:rsidR="004C4814"><w:instrText xml:space="preserve"> ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"Ew4pyAhV","properties":{"formattedCitation":"[2]","plainCitation":"[2]","noteIndex":0},"citationItems":[{"id":2,"uris":["http://zotero.org/users/7015192/items/itemID"],"uri":["http://zotero.org/users/7015192/items/ITEMID"],"itemData":{"id":2,"type":"article-journal","abstract":"Article by authors published","container-title":"Some journal","issue":"1","page":"1-19","source":"source.site","title":"Paper Title","volume":"9","author":[{"family":"Family 1","given":"Name 1"},{"family":"Family2","given":"Name 2"}],"issued":{"date-parts":[["2021",3,30]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"} </w:instrText></w:r><w:r w:rsidR="004C4814"><w:fldChar w:fldCharType="separate"/></w:r><w:r w:rsidR="004C4814" w:rsidRPr="004C4814"><w:rPr><w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Calibri"/></w:rPr><w:t>[2]</w:t></w:r><w:r w:rsidR="004C4814"><w:fldChar w:fldCharType="end"/></w:r></w:p>
Here is it prettyfied (hopefully no information lost):
<w:p w14:paraId="341C60CB" w14:textId="470EBF40" w:rsidR="00B90656" w:rsidRDefault="00B90656" w:rsidP="00B90656">
<w:pPr>
<w:pStyle w:val="Caption"/>
<w:keepNext/>
</w:pPr>
<w:bookmarkStart w:id="0" w:name="_Ref90117101"/>
<w:r>
<w:t xml:space="preserve">Table </w:t>
</w:r>
<w:fldSimple w:instr=" SEQ Table \* ARABIC ">
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>1</w:t>
</w:r>
</w:fldSimple>
<w:bookmarkEnd w:id="0"/>
<w:r>
<w:t>. Some table caption</w:t>
</w:r>
<w:r w:rsidR="004C4814">
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:r w:rsidR="004C4814">
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="004C4814">
<w:instrText xml:space="preserve"> ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"Ew4pyAhV","properties":{"formattedCitation":"[2]","plainCitation":"[2]","noteIndex":0},"citationItems":[{"id":2,"uris":["http://zotero.org/users/7015192/items/itemID"],"uri":["http://zotero.org/users/7015192/items/ITEMID"],"itemData":{"id":2,"type":"article-journal","abstract":"Article by authors published","container-title":"Some journal","issue":"1","page":"1-19","source":"source.site","title":"Paper Title","volume":"9","author":[{"family":"Family 1","given":"Name 1"},{"family":"Family2","given":"Name 2"}],"issued":{"date-parts":[["2021",3,30]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"} </w:instrText>
</w:r>
<w:r w:rsidR="004C4814">
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="004C4814" w:rsidRPr="004C4814">
<w:rPr>
<w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Calibri"/>
</w:rPr>
<w:t>[2]</w:t>
</w:r>
<w:r w:rsidR="004C4814">
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:p>
The reference is there as a Zotero item which usually converts well to reference, but in this case it is skipped. Note that this reference does not appear anywhere else in the document (so basically a case where an author displays a table with data obtained elswhere)
It then converts to:
<label>Table 1</label>
<caption>
<title>. Some table caption [2]</title>
</caption>
After inspecting the code, the best option so far is:
- identify and save the reference from a caption in the document when the caption is identified and set: https://github.com/Vitaliy-1/docxToJats/blob/30bb7c165d11832a21ea87769b9b368940fe58f1/src/docx2jats/objectModel/Document.php#L146, use the same approach as for references in paragraphs here: https://github.com/Vitaliy-1/docxToJats/blob/30bb7c165d11832a21ea87769b9b368940fe58f1/src/docx2jats/objectModel/Document.php#L154
- Treat caption title as a Text object: https://github.com/Vitaliy-1/docxToJats/blob/30bb7c165d11832a21ea87769b9b368940fe58f1/src/docx2jats/objectModel/body/Table.php#L107 but don't allow to use of formatted text (in-text citation is a property of Text)
- Don't include text runs included in Complex Field Character (see
w:fldChar
tag) into the caption title as, in the examples presented, they are pointing to the table/figure sequence number. It will fix the issue:
1 In vel pellentesque est, eu placerat felis sed porttitor felis in aliquet imperdiet.
I am slowly fighting my way through the code, and looking at it I wander if it would be possible to treat the caption as a paragraph object. As a simple test, I added a line to create a dummy paragraph when the caption is detected in the Document.php file. It wasn't used afterwards, but it did pick up the Zotero reference item and added it to the bibliography list. I wanted to pass the paragraph object further to the table object (as it is with Label and Title elements) but ran into some recursion issues and memory problems when getting to the part where the information should be parsed into XML.
As it is now, the caption title is treated as a plain text so it does not pick up any additional formatting either (we have quite a bit subscripts there - like in CO2, H2O, etc.).
As I see it, treating the Caption title as a paragraph (which it basically is in Word) might be the easiest way to handle all possible situations, and in the XML creation part to simply strip the paragraph tag, but to leave all the other elements found in it. This would also be in accordance with the JATS standard.
I am slowly fighting my way through the code, and looking at it I wander if it would be possible to treat the caption as a paragraph object.
As I see it, treating the Caption title as a paragraph (which it basically is in Word) might be the easiest way to handle all possible situations
Yes, MS Word and LibreOffice Writer treat captions as a paragraph with a Caption
style. But according to OOXML documentation, it should be caption
element inside the table properties: http://officeopenxml.com/WPtableCaption.php
It's allowed to have several paragraphs having a caption styling:
<w:p>
<w:pPr>
<w:pStyle w:val="Caption"/>
<w:keepNext w:val="true"/>
<w:bidi w:val="0"/>
<w:spacing w:before="120" w:after="120"/>
<w:jc w:val="left"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r>
<w:rPr></w:rPr>
<w:t xml:space="preserve">Table </w:t>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:fldChar w:fldCharType="begin"></w:fldChar>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:instrText>SEQ Table \* ARABIC</w:instrText>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:t>1</w:t>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
<w:r>
<w:rPr></w:rPr>
<w:t>: This is a table caption</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="Caption"/>
<w:bidi w:val="0"/>
<w:jc w:val="left"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r>
<w:rPr></w:rPr>
<w:t>This is another table caption that occupies another paragraph</w:t>
</w:r>
</w:p>
So, if stick with this widespread practice, we can assume a caption element consisting of several paragraphs. The first problem here is identifying the link between table/figure and its caption. For now, it's done by finding the nearest table or figure: the analogue of lookbehind on one element and then lookahead until table or figure is found. which isn't exactly the right algorithm.
The second problem here is extracting unstructured caption data, from these paragraphs with caption style. JATS supports label
, title
and description, presented as simple paragraph elements p
. In the current state, just by parsing OOXML it's impossible to do and I hope to delegate this task to a model trained on specific ML algorithms.
As an intermediate solution, I decided to treat parsed data from captions as a table title
. Although title
according to JATS specification can include formatted text, links, etc. as a p
element does, it's not supported by Texture (JATS Editor) and it overloads the document representation with redundant styling information. The link (e.g., to a reference) is a specific case and I think it should be allowed here. It's a part of a broader problem about simplifying JATS format for reusability, see, e.g., DAR subset of JATS XML.
So, answering the initial question, it's better not to allow in table/figure title
the same elements, which are allowed in the paragraph. Let's allow only links to references there, as an exception, until introducing a valid approach of discriminating between caption title and description.
I wanted to pass the paragraph object further to the table object (as it is with Label and Title elements) but ran into some recursion issues and memory problems when getting to the part where the information should be parsed into XML
Maybe it's because the text inside paragraphs is parsed with a tricky strategy. I was forced to implement it, because of a different nature of text formating in OOXML and JATS XML/HTML worlds. The styling in OOXML is defined as a property of a text run element but in JATS XML it's defined by a tag (<italic>
, ,<bold>
).
So, if stick with this widespread practice, we can assume a caption element consisting of several paragraphs. The first problem here is identifying the link between table/figure and its caption. For now, it's done by finding the nearest table or figure: the analogue of lookbehind on one element and then lookahead until table or figure is found. which isn't exactly the right algorithm.
Generally, I think it is a good approach and would stick to it. I can see the problem arising only in a case when there are multiple figures under one caption (Fig 1a, b, c...) which have "sub-captions" but which are not defined with a captioning tool nor style. But I believe the caption would in this case be linked to the last of the images so no great loss there.
To keep it simple, and yet covering vast majority of cases, I would not process as captions paragraphs which do not have w:instrText
tag inside with SEQ Table/Figure
part (and possibly a bookmark reference code). I must admit, I have not yet come across a multiple paragraph caption in our papers.
For general usability, simply getting bibliographic references caught in the caption part and converted to xref ref-type="bibr"
tag in the final XML within the caption title would be a huge step forward.
@markoban, in theory, it should work but I haven't tested it yet. The changes for parsing references in table or figure caption are on this branch: https://github.com/Vitaliy-1/docxToJats/tree/i28_caption_refs. It's implemented only for MS Word for now.
Hi @Vitaliy-1
Thanks, I'll give it a go :)
Is implemented