visidata icon indicating copy to clipboard operation
visidata copied to clipboard

[html rowspan] Rowspan is not handled.

Open frosencrantz opened this issue 2 years ago • 11 comments

Small description HTML table loading doesn't handle rowspan properly.

Expected result The data in the rowspan column is duplicated on the spanned rows.

Actual result with screenshot

https://asciinema.org/a/qotdqplkmpKJxPUQUC5kaGHEd

Animation shows html files, how w3m renders the data, and how Visidata shows data. The columns with the errors have this exception:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/visidata/wrappers.py", line 108, in wrapply
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/visidata/column.py", line 273, in getValue
    return self.calcValue(row)
  File "/usr/lib/python3.9/site-packages/visidata/column.py", line 235, in calcValue
    return (self.getter)(self, row)
  File "/usr/lib/python3.9/site-packages/visidata/loaders/html.py", line 103, in <lambda>
    self.addColumn(Column(name, getter=lambda c,r,i=colnum: r[i][0]))
IndexError: list index out of range

Steps to reproduce with sample data and a .vd

I would expect the data from columns 2 & 3 for both of these tables would be the same.

Regular 3x3.html (works)

<table >
        <tbody>
                <tr>
                        <td>1.1 </td>
                        <td>1.2 </td>
                        <td>1.3 </td>
                </tr>
                <tr>
                        <td> 2.1 </td>
                        <td> 2.2 </td>
                        <td> 2.3 </td>
                </tr>
                <tr>
                        <td> 3.1 </td>
                        <td> 3.2 </td>
                        <td> 3.3 </td>
                </tr>
        </tbody>
</table>

With row span 3x3-rowspan.html (breaks):

<table >
        <tbody>
                <tr>
                        <td rowspan=3>1.1 </td>
                        <td>1.2 </td>
                        <td>1.3 </td>
                </tr>
                <tr>
                        <td> 2.2 </td>
                        <td> 2.3 </td>
                </tr>
                <tr>
                        <td> 3.2 </td>
                        <td> 3.3 </td>
                </tr>
        </tbody>
</table>

Additional context Please include the version of VisiData. Using latest version from develop branch

I have data sources I try to use with Visidata that make use of rowspan to format html tables.

There is code in the html loader to handle rowspan for column headers.

frosencrantz avatar Feb 27 '22 19:02 frosencrantz

I realized later that the animation was missing the w3m output. Maybe next week I can show that. Though I think from the Visidata output you can see that for some rows the values are shifted left without a direct cell in the column with the rowspan.

frosencrantz avatar Mar 06 '22 23:03 frosencrantz

Here is the w3m-only screenshot. The first one shows the table with rowspan, and the second example without rowspan.

https://asciinema.org/a/SE31VbT4U156d9s1VBFJO7Rxs

That first column that uses rowspan should touch all rows that are spanned, not just the first row.

frosencrantz avatar Mar 12 '22 18:03 frosencrantz

Thank you for providing the two nearly identical sets of sample data, one with rowspan and one without. It helps with seeing the problem much clearer!

anjakefala avatar Mar 12 '22 19:03 anjakefala

Rowspans used to be at least partially handled! I think they were not handled in the way that is expected, and the logic needs to be adjusted.

But they were not resulting in an Exception. This is the change where the Cell-Exceptions started: 8a663b839c6528197d93a35ee3b09fb29f176226

anjakefala avatar Mar 12 '22 20:03 anjakefala

One thing to note:

VisiData expects that the rowspan attribute is in a <th> tag! Is it rowspan being in the first <td> (instead of there being an explicit <th> a realistic scenario?

I.e. this is what VisiData is expecting for rowspan:

  <tr>                                                                                                                                  
                         <th rowspan=3>1.1 </th>                                                                                                       
                         <th>1.2 </th>                                                                                                                 
                         <th>1.3 </th>                                                                                                                 
                 </tr>      

Edit: It seems like rowspan could be expected in <td>. So this issue has two parts:

  • Do we want to handle the rowspan scenario for <td>, and what that will look like
  • rowspan in <th> does the right structuring in the Sheet, but still ends up with within-cell Exceptions that we will need to handle.

anjakefala avatar Mar 12 '22 20:03 anjakefala

I'm not sure if you are asking me. I do think the rowspan should be handled for <td> cells, since it has implications for how other rows/cols are aligned. Compared to how browsers present the data, Visidata incorrectly displays the data, and misses the intended structure.

frosencrantz avatar Mar 12 '22 23:03 frosencrantz

Hi @anjakefala

I had looked a little more at this issue. It reminds me how odd html tables are as a data format. And when I look for live examples, I find worst examples.

Here is a simple example file I created that shows the difference between header and data rows. The html is basically the same for the header as the body (except the replacing of th/td and thead/tbody).

You can see that w3m formats them the same, but visidata has a different view. For the header rows, it looks like visidata is doing the expected thing by flattening the values into one header row. For the body rows it shows this bug where colspan/rowspan are ignored.

https://asciinema.org/a/7GI0SKWYPecD8hcq1utN6RrPU

<table border>
        <thead>
                <tr>
                        <th rowspan=2 colspan=2>1.1 </th>
                        <th>1.3 </th>
                </tr>
                <tr>
                        <th> 2.3 </th>
                </tr>
                <tr>
                        <th> 3.1 </th>
                        <th colspan=2> 3.2 </th>
                </tr>

        </thead>
        <tbody>
                <tr>
                        <td rowspan=2 colspan=2>1.1 </td>
                        <td>1.3 </td>
                </tr>
                <tr>
                        <td> 2.3 </td>
                </tr>
                <tr>
                        <td> 3.1 </td>
                        <td colspan=2> 3.2 </td>
                </tr>
        </tbody>
</table>

frosencrantz avatar May 08 '22 21:05 frosencrantz

FYI: I found a tool that claims to handle reading tables with a colspan/rowspan: https://github.com/rocheio/wiki-table-scrape

It works if you have only one of the types of spans, but my simple tests suggested it doesn't properly handle both types of spans for the same cell. It looks like it misses the lower right corner of a colspan=2 rowspan=2

frosencrantz avatar Jun 04 '23 22:06 frosencrantz

VisiData has an alternate way to read html with pandas, so I tried that, but I found a new bug: https://github.com/saulpw/visidata/issues/1986

Pandas read_html function returns a list of DataFrames while other read_* functions return a DataFrame.

frosencrantz avatar Aug 06 '23 17:08 frosencrantz

The panda's reader also seem to have issues with some of the tables I want to be able to read.

Here is a deep dive of how to parse html tables including algorithms:

https://html.spec.whatwg.org/multipage/tables.html#table-processing-model

frosencrantz avatar Dec 28 '23 22:12 frosencrantz

One thing in reading this is that a table is modeled by a 2-D grid of slots, very much like VisiData. Some slots can be empty, or they can be occupied one or more cells (e.g. TD/TH). Cells occupy the slot they first encounter, and may occupy more, but only to the right and down because of colspan/rowspan:

A table consists of cells aligned on a two-dimensional grid of slots with coordinates (x, y). The grid is finite, and is either empty or has one or more slots. If the grid has one or more slots, then the x coordinates are always in the range 0 ≤ x < xwidth, and the y coordinates are always in the range 0 ≤ y < yheight. If one or both of xwidth and yheight are zero, then the table is empty (has no slots). Tables correspond to table elements.

frosencrantz avatar Dec 30 '23 00:12 frosencrantz