PHPWord icon indicating copy to clipboard operation
PHPWord copied to clipboard

Issue with reading Tabs in Heading 1 from docx file

Open AlexMonaghanHop opened this issue 4 years ago • 1 comments

Firstly, this is an excellent library and has enabled us to import data from old documents into our web database with relative ease. I can't upload the original source document as it's unpublished work, however, I've been able to reproduce the issue with a clean document.

In very simple terms, if a line is formatted as "Heading 1" and contains text delimited by tabs, it will load as a single "Title" object containing just the text before the 1st tab, if that same text is formatted as "Normal" it returns a TextRun consisting Text objects to make up the original document text.

I've tested this on release 18.2

Document Test (tab) test (tab) test test2 (tab) test2 (tab) test2

Saved as a .docx (I'm using LibreOffice on Linux, but my original document was a Word document from a Windows environment)

Simple test code to extract the text

<?php
require_once 'vendor/autoload.php';
$phpWord = \PhpOffice\PhpWord\IOFactory::load('test.docx');
foreach ($phpWord->getSections() as $section) {
    foreach ($section->getElements() as $ele1) {
      echo "Element Class: " . get_class($ele1) . "\n";
      if ($ele1 instanceof \PhpOffice\PhpWord\Element\Text) {
        echo $ele1->getText() . "\n";
      }
      if ($ele1 instanceof \PhpOffice\PhpWord\Element\Title) {
        echo $ele1->getText() . "\n";
      }
      if ($ele1 instanceof \PhpOffice\PhpWord\Element\TextRun) {
        foreach ($ele1->getElements() as $ele2) {
          echo "Class: " . get_class($ele2) . "\n";
          if ($ele2 instanceof \PhpOffice\PhpWord\Element\Text) {
            echo $ele2->getText() . "\n";
          }
        }
      }
    }
  }

With the entire document formatted as "Default Style" $ php wordtest.php Element Class: PhpOffice\PhpWord\Element\TextRun Class: PhpOffice\PhpWord\Element\Text Test Class: PhpOffice\PhpWord\Element\Text

Class: PhpOffice\PhpWord\Element\Text test Class: PhpOffice\PhpWord\Element\Text

Class: PhpOffice\PhpWord\Element\Text test Element Class: PhpOffice\PhpWord\Element\TextRun Class: PhpOffice\PhpWord\Element\Text test2 Class: PhpOffice\PhpWord\Element\Text

Class: PhpOffice\PhpWord\Element\Text test2 Class: PhpOffice\PhpWord\Element\Text

Class: PhpOffice\PhpWord\Element\Text test2

Format test2 as "Heading 1" $ php wordtest.php Element Class: PhpOffice\PhpWord\Element\TextRun Class: PhpOffice\PhpWord\Element\Text Test Class: PhpOffice\PhpWord\Element\Text

Class: PhpOffice\PhpWord\Element\Text test Class: PhpOffice\PhpWord\Element\Text

Class: PhpOffice\PhpWord\Element\Text test Element Class: PhpOffice\PhpWord\Element\Title test2

If I try to dig down into the guts of the Title object it only consists of a single text entry consisting of the 1st "test2" before the tab and there are no sub elements. If you fiddle with the test2 line, so it reads differently (2 tabs then text or other combinations) it will then return as a Title object consisting of a TextRun which can then be unpacked and gives back the original text as expected. The issue only seems to happen when there is a tab within the "Heading 1". Digging into the XML of the original Word document that I found this issue on showed no difference in the structure of the document other than the <w:pStyle w:val="Heading1"/> rather than <w:pStyle w:val="Normal"/> so I'm fairly confident that it's something happening inside PhpWord.

I acknowledge that the Word document is most probably not crafted very well given the tabs in the heading (in fact the rest of the table and tables in other documents are not marked up with a "Heading 1" style), but the source documents are authored by others so I don't have full control of what I work with.

I'm not sufficiently experienced with the internals of the library to resolve this issue, but hopefully my report will allow someone to investigate and resolve.

AlexMonaghanHop avatar Oct 14 '21 14:10 AlexMonaghanHop

@AlexMonaghanHop Hi, Could you send me a file with error, for analyzing it, please ?

Progi1984 avatar Aug 22 '24 12:08 Progi1984