html2text icon indicating copy to clipboard operation
html2text copied to clipboard

Long table rows cause incorrect table conversions

Open ziima opened this issue 8 years ago • 2 comments

  • Version by html2text.__version__: (2016, 9, 19)
  • Test script
import html2text
conv = html2text.HTML2Text()
conv.pad_tables = True

print conv.handle('<table><tr><td>1</td><td>2</td></tr><tr><td>juju</td><td><a href="http://example.com/there/is/some/very/long/path">huhu</a></td></tr></table>')
# 1    | 2                                                       
# -----|---------------------------------------------------------
# juju | [huhu](http://example.com/there/is/some/very/long/path) 
conv.body_width = 40
print conv.handle('<table><tr><td>1</td><td>2</td></tr><tr><td>juju</td><td><a href="http://example.com/there/is/some/very/long/path">huhu</a></td></tr></table>')
# 1                                                       
# --------------------------------------------------------
# juju                                                    
# [huhu](http://example.com/there/is/some/very/long/path) 

  • Python version python --version: 2.7.13

ziima avatar May 11 '17 09:05 ziima

I have this problem also, it looks like html2text does line wrapping at 78 or 80 characters that messes up the markdown table layout. Also has problem when a row spans multiple columns. E.g.:

Input:

<table>
 <tr class="lightgrey">
   <td class="td_row"><b>Row#</td>
   <td class="td_verification"><b>Web Service</td>
   <td class="td_endpoint"><b>Description</td>
   <td class="td_call_status"><b>Call Status</td>
   <td class="td_verification"><b>Passed Verification</td>
   <td class="td_verification"><b>Failed Verification</td></tr>
<tr class="comment">
<td colspan="6"><H1>Q0_WS01 RedemptionServices</H1></td></tr>
<tr class="comment">
<td colspan="6"><H2>0. SetUp</H2></td></tr>
 <tr class="lightgreen">
<td class="td_row_num">5</td>
<td class="td_nowrap">createAPMember1.0 <a href="RedemptionServices.xls-Z5H-20171207-123235-Row-5-Request.xml">Request</a> <a href="RedemptionServices.xls-Z5H-20171207-123235-Row-5-Response.xml">Response</a></td>
<td>Setup step: Create test AirPoints member</td>
<td>PASS</td>
<td>6</td>
<td>0</td></tr>

Result: The table layout is broken up and long rows have been wrapped with hard line breaks at column 78

**Row# | **Web Service | **Description | **Call Status | **Passed Verification
| **Failed Verification
---|---|---|---|---|---

# Q0_WS01 RedemptionServices

## 0\. SetUp

5 | createAPMember1.0
[Request](RedemptionServices.xls-Z5H-20171207-123235-Row-5-Request.xml)
[Response](RedemptionServices.xls-Z5H-20171207-123235-Row-5-Response.xml) |
Setup step: Create test AirPoints member | PASS | 6 | 0

roblogic avatar Dec 07 '17 03:12 roblogic

Fix:

  • don't pad tables
  • do use the following options (or their equvalent methods)
$ html2text -b0 --no-wrap-links AnnoyingTabulatedReport.html > NicerReport.md

(rows that span multiple columns are still not handled but I suppose that's a separate issue)

roblogic avatar Dec 07 '17 04:12 roblogic