mammoth.js icon indicating copy to clipboard operation
mammoth.js copied to clipboard

Accessibility issue: tables are not outputting <thead> and <th> tags

Open Dan503 opened this issue 7 years ago • 9 comments

In Microsoft Word, you can define if you wish your table to contain a < thead > and first column < th > tags through this toolbar:

Image of Word document toolbar

However these settings are being ignored by Mammoth when a word document is being processed. Instead it strips all table headings out and simply outputs a table of basic table cells.

For the following table, I used the settings used in the image above.

Example table

This is the output that I was expecting Mammoth to output:

<table>
    <thead>
        <tr>
            <th>
                <p>Name</p>
            </th>
            <th>
                <p>Number</p>
            </th>
            <th>
                <p>Year</p>
            </td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>
                <p>Thing</p>
            </th>
            <td>
                <p>123</p>
            </td>
            <td>
                <p>2017</p>
            </td>
        </tr>
        <tr>
            <th>
                <p>Other thing</p>
            </th>
            <td>
                <p>458</p>
            </td>
            <td>
                <p>2016</p>
            </td>
        </tr>
    </tbody>
</table>

This is the markup I got though:

<table>
    <tbody>
        <tr>
            <td>
                <p>Name</p>
            </td>
            <td>
                <p>Number</p>
            </td>
            <td>
                <p>Year</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Thing</p>
            </td>
            <td>
                <p>123</p>
            </td>
            <td>
                <p>2017</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Other thing</p>
            </td>
            <td>
                <p>458</p>
            </td>
            <td>
                <p>2016</p>
            </td>
        </tr>
    </tbody>
</table>

Mammoth version: v1.4.2 OS: Windows 10 node.js version: 6.9.4

Dan503 avatar Jul 25 '17 00:07 Dan503

Could you provide a minimal example document?

On Mon, 24 Jul 2017 17:31:20 -0700 Daniel Tonon [email protected] wrote:

In Microsoft Word, you can define if you wish your table to contain a < thead > and first column < th > tags through this toolbar:

Image of Word document
toolbar

However these settings are being ignored by Mammoth when a word document is being processed. Instead it strips all table headings out and simply outputs a table of basic table cells.

For the following table, I used the settings used in the image above.

Example
table

This is the output that I was expecting Mammoth to output:

<table>
    <thead>
        <tr>
            <th>
                <p>Name</p>
            </th>
            <th>
                <p>Number</p>
            </th>
            <th>
                <p>Year</p>
            </td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>
                <p>Thing</p>
            </th>
            <td>
                <p>123</p>
            </td>
            <td>
                <p>2017</p>
            </td>
        </tr>
        <tr>
            <th>
                <p>Other thing</p>
            </th>
            <td>
                <p>458</p>
            </td>
            <td>
                <p>2016</p>
            </td>
        </tr>
    </tbody>
</table>

This is the markup I got though:

<table>
    <tbody>
        <tr>
            <td>
                <p>Name</p>
            </td>
            <td>
                <p>Number</p>
            </td>
            <td>
                <p>Year</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Thing</p>
            </td>
            <td>
                <p>123</p>
            </td>
            <td>
                <p>2017</p>
            </td>
        </tr>
        <tr>
            <td>
                <p>Other thing</p>
            </td>
            <td>
                <p>458</p>
            </td>
            <td>
                <p>2016</p>
            </td>
        </tr>
    </tbody>
</table>

Mammoth version: v1.4.2 OS: Windows 10 node.js version: 6.9.4

mwilliamson avatar Jul 25 '17 08:07 mwilliamson

Here is a minimal example word document: mammoth-table-issue.docx

Dan503 avatar Jul 25 '17 08:07 Dan503

To support this, it looks like w:tbl/w:tblPr/w:tblLook/@w:firstRow and w:tbl/w:tblPr/w:tblLook/@w:firstColumn needs to be read.

It's also worth noting that thead and th tags should be created if you mark rows as being repeated header rows.

mwilliamson avatar Jul 25 '17 18:07 mwilliamson

I'm just wondering, is this bug likely to be fixed by the 1st of September?

My company has a site going live in a few months and it depends on this bug being fixed for it to pass accessibility.

Dan503 avatar Aug 08 '17 05:08 Dan503

Adding support should be reasonably straightforward, but I'm not sure when I'll get time to work on this (since it's just a side-project). In other words, I wouldn't rely on it.

mwilliamson avatar Aug 08 '17 20:08 mwilliamson

I'm planning on doing the fix myself as a pull request.

Can you help point me in the right direction so I know where to apply the fix?

Dan503 avatar Aug 14 '17 07:08 Dan503

There are two main places you'd need to look at. One is the code that parses the document in lib/docx/body-reader.js. The existing code that handles table headers is probably a good feature to look at for a rough idea of how to implement this. For header rows, you probably want to reuse the same property i.e. isHeader on table rows, plus add a property to handle header columns. You then need to update the conversion to HTML in lib/document-to-html.js. Header rows will already be handled by the existing code, but you'd need to add support for header columns.

Each module should be covered by tests. The test directory structure should mirror the directory structure of the code under test, so hopefully they're reasonably straightforward to navigate around. Again, looking for the existing support for table headers is probably a good place to start.

mwilliamson avatar Aug 14 '17 18:08 mwilliamson

@mwilliamson , I'm facing exactly the same issue on the python implementation. How can I go about getting a fix for it there? Is the JS code comparable that I could migrate it, or would the approach be different?

Thanks! Grant

P.S.: Great library by the way! Thanks for implementing it.

grantstead avatar Apr 19 '19 07:04 grantstead

The Python implementation is fairly similar to the JavaScript implementation, but it's worth noting that they (should!) have the same level of support for tables.

mwilliamson avatar Apr 20 '19 13:04 mwilliamson