python-markdownify
python-markdownify copied to clipboard
Add table 'rowspan' support
Had a quick look at the code and it seems that there's support for 'colspan' attribute, but not 'rowspan'. Any plans to add support?
HTML example
<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
border: 1px solid black;
}
</style>
</head>
<body>
<h1>The td rowspan attribute</h1>
<table>
<tr>
<th>Month</th>
<th>Savings</th>
<th>Savings for holiday!</th>
</tr>
<tr>
<td>January</td>
<td>$100</td>
<td rowspan="2">$50</td>
</tr>
<tr>
<td>February</td>
<td>$80</td>
</tr>
</table>
</body>
</html>
Parsed MD table
The td rowspan attribute
========================
| Month | Savings | Savings for holiday! |
| --- | --- | --- |
| January | $100 | $50 |
| February | $80 |
Desired MD output
The td rowspan attribute
========================
| Month | Savings | Savings for holiday! |
| --- | --- | --- |
| January | $100 | $50 |
| February | $80 | |
I had this issue as well, and I was able to get the desired behavior with a customization.
Requires:
- pandas
- tabulate
- html5lib
import pandas as pd
class MyMarkdownConverter(MarkdownConverter):
"""A custom MarkdownConverter.
This class is a subclass of the MarkdownConverter class from the markdownify library.
It overrides the convert_table, convert_th, convert_tr, convert_td, convert_thead, and convert_tbody methods
to provide a No-Op for the <th>, <tr>, <td>, <thead>, and <tbody> tags, respectively.
For <table> tags, it converts the table to a DataFrame and then converts the DataFrame to Markdown.
This gives us the desired behavior of handling rowspan, which markdownify does not handle.
"""
def convert_table(self, el, text, convert_as_inline):
try:
df = pd.read_html(StringIO(str(el)))[0]
# replace nan with empty string
df = df.fillna("")
except Exception as e:
print(f"Error converting table to DataFrame: {str(el)}")
print(e)
# Convert DataFrame to Markdown
return df.to_markdown(index=False)
def convert_th(self, el: NavigableString, text, convert_as_inline):
"""This method is empty because we want a No-Op for the <th> tag."""
# return the html as is
return str(el)
def convert_tr(self, el: NavigableString, text, convert_as_inline):
"""This method is empty because we want a No-Op for the <tr> tag."""
return str(el)
def convert_td(self, el: NavigableString, text, convert_as_inline):
"""This method is empty because we want a No-Op for the <td> tag."""
return str(el)
def convert_thead(self, el: NavigableString, text, convert_as_inline):
"""This method is empty because we want a No-Op for the <thead> tag."""
return str(el)
def convert_tbody(self, el: NavigableString, text, convert_as_inline):
"""This method is empty because we want a No-Op for the <tbody> tag."""
return str(el)