PHPWord
PHPWord copied to clipboard
Issue with reading Tabs in Heading 1 from docx file
Firstly, this is an excellent library and has enabled us to import data from old documents into our web database with relative ease. I can't upload the original source document as it's unpublished work, however, I've been able to reproduce the issue with a clean document.
In very simple terms, if a line is formatted as "Heading 1" and contains text delimited by tabs, it will load as a single "Title" object containing just the text before the 1st tab, if that same text is formatted as "Normal" it returns a TextRun consisting Text objects to make up the original document text.
I've tested this on release 18.2
Document Test (tab) test (tab) test test2 (tab) test2 (tab) test2
Saved as a .docx (I'm using LibreOffice on Linux, but my original document was a Word document from a Windows environment)
Simple test code to extract the text
<?php
require_once 'vendor/autoload.php';
$phpWord = \PhpOffice\PhpWord\IOFactory::load('test.docx');
foreach ($phpWord->getSections() as $section) {
foreach ($section->getElements() as $ele1) {
echo "Element Class: " . get_class($ele1) . "\n";
if ($ele1 instanceof \PhpOffice\PhpWord\Element\Text) {
echo $ele1->getText() . "\n";
}
if ($ele1 instanceof \PhpOffice\PhpWord\Element\Title) {
echo $ele1->getText() . "\n";
}
if ($ele1 instanceof \PhpOffice\PhpWord\Element\TextRun) {
foreach ($ele1->getElements() as $ele2) {
echo "Class: " . get_class($ele2) . "\n";
if ($ele2 instanceof \PhpOffice\PhpWord\Element\Text) {
echo $ele2->getText() . "\n";
}
}
}
}
}
With the entire document formatted as "Default Style" $ php wordtest.php Element Class: PhpOffice\PhpWord\Element\TextRun Class: PhpOffice\PhpWord\Element\Text Test Class: PhpOffice\PhpWord\Element\Text
Class: PhpOffice\PhpWord\Element\Text test Class: PhpOffice\PhpWord\Element\Text
Class: PhpOffice\PhpWord\Element\Text test Element Class: PhpOffice\PhpWord\Element\TextRun Class: PhpOffice\PhpWord\Element\Text test2 Class: PhpOffice\PhpWord\Element\Text
Class: PhpOffice\PhpWord\Element\Text test2 Class: PhpOffice\PhpWord\Element\Text
Class: PhpOffice\PhpWord\Element\Text test2
Format test2 as "Heading 1" $ php wordtest.php Element Class: PhpOffice\PhpWord\Element\TextRun Class: PhpOffice\PhpWord\Element\Text Test Class: PhpOffice\PhpWord\Element\Text
Class: PhpOffice\PhpWord\Element\Text test Class: PhpOffice\PhpWord\Element\Text
Class: PhpOffice\PhpWord\Element\Text test Element Class: PhpOffice\PhpWord\Element\Title test2
If I try to dig down into the guts of the Title object it only consists of a single text entry consisting of the 1st "test2" before the tab and there are no sub elements. If you fiddle with the test2 line, so it reads differently (2 tabs then text or other combinations) it will then return as a Title object consisting of a TextRun which can then be unpacked and gives back the original text as expected. The issue only seems to happen when there is a tab within the "Heading 1". Digging into the XML of the original Word document that I found this issue on showed no difference in the structure of the document other than the <w:pStyle w:val="Heading1"/> rather than <w:pStyle w:val="Normal"/> so I'm fairly confident that it's something happening inside PhpWord.
I acknowledge that the Word document is most probably not crafted very well given the tabs in the heading (in fact the rest of the table and tables in other documents are not marked up with a "Heading 1" style), but the source documents are authored by others so I don't have full control of what I work with.
I'm not sufficiently experienced with the internals of the library to resolve this issue, but hopefully my report will allow someone to investigate and resolve.
@AlexMonaghanHop Hi, Could you send me a file with error, for analyzing it, please ?