python-markdownify icon indicating copy to clipboard operation
python-markdownify copied to clipboard

Add table 'rowspan' support

Open ffolkes1911 opened this issue 10 months ago • 1 comments

Had a quick look at the code and it seems that there's support for 'colspan' attribute, but not 'rowspan'. Any plans to add support?

HTML example
<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
</style>
</head>
<body>

<h1>The td rowspan attribute</h1>

<table>
  <tr>
    <th>Month</th>
    <th>Savings</th>
    <th>Savings for holiday!</th>
  </tr>
  <tr>
    <td>January</td>
    <td>$100</td>
    <td rowspan="2">$50</td>
  </tr>
  <tr>
    <td>February</td>
    <td>$80</td>
  </tr>
</table>

</body>
</html>

Parsed MD table
The td rowspan attribute
========================


| Month | Savings | Savings for holiday! |
| --- | --- | --- |
| January | $100 | $50 |
| February | $80 |
Desired MD output
The td rowspan attribute
========================


| Month | Savings | Savings for holiday! |
| --- | --- | --- |
| January | $100 | $50 |
| February | $80 | |

ffolkes1911 avatar Apr 11 '24 09:04 ffolkes1911

I had this issue as well, and I was able to get the desired behavior with a customization.

Requires:

  • pandas
  • tabulate
  • html5lib
import pandas as pd

class MyMarkdownConverter(MarkdownConverter):
    """A custom MarkdownConverter.

    This class is a subclass of the MarkdownConverter class from the markdownify library.
    It overrides the convert_table, convert_th, convert_tr, convert_td, convert_thead, and convert_tbody methods
    to provide a No-Op for the <th>, <tr>, <td>, <thead>, and <tbody> tags, respectively.

    For <table> tags, it converts the table to a DataFrame and then converts the DataFrame to Markdown.
    This gives us the desired behavior of handling rowspan, which markdownify does not handle.
    """

    def convert_table(self, el, text, convert_as_inline):
        try:
            df = pd.read_html(StringIO(str(el)))[0]
            # replace nan with empty string
            df = df.fillna("")
        except Exception as e:
            print(f"Error converting table to DataFrame: {str(el)}")
            print(e)

        # Convert DataFrame to Markdown
        return df.to_markdown(index=False)

    def convert_th(self, el: NavigableString, text, convert_as_inline):
        """This method is empty because we want a No-Op for the <th> tag."""
        # return the html as is
        return str(el)

    def convert_tr(self, el: NavigableString, text, convert_as_inline):
        """This method is empty because we want a No-Op for the <tr> tag."""
        return str(el)

    def convert_td(self, el: NavigableString, text, convert_as_inline):
        """This method is empty because we want a No-Op for the <td> tag."""
        return str(el)

    def convert_thead(self, el: NavigableString, text, convert_as_inline):
        """This method is empty because we want a No-Op for the <thead> tag."""
        return str(el)

    def convert_tbody(self, el: NavigableString, text, convert_as_inline):
        """This method is empty because we want a No-Op for the <tbody> tag."""
        return str(el)

andrewDoing avatar Aug 01 '24 20:08 andrewDoing