Gumbo.jl icon indicating copy to clipboard operation
Gumbo.jl copied to clipboard

Convert HTML file with table(s) to DataFrame

Open s-celles opened this issue 6 years ago • 12 comments

Hello,

I have an HTML file with a table and would like to convert it to a Julia DataFrame.

I was looking for a function similar to Python Pandas read_html function (which directly output a list of DataFrame).

Unfortunately I don't see similar function in Julia ecosystem

In Gumbo doc I was looking for an example to iterate over rows and colums of each table

here is a basic HTML source file with 2 tables

<!DOCTYPE >
<HTML>
  <head></head>
  <body>

    <h1>First table</h1>
    <table>
      <tbody>
        <tr>
          <th>
            A
          </th>
          <th>
            B
          </th>
        </tr>
        <tr>
          <td>
            1
          </td>
          <td>
            1.1
          </td>
        </tr>
        <tr>
          <td>
            2
          </td>
          <td>
            2.1
          </td>
        </tr>
      </tbody>
    </table>

    <h1>Second table</h1>
    <table>
      <tbody>
        <tr>
          <th>
            AA
          </th>
          <th>
            BB
          </th>
        </tr>
        <tr>
          <td>
            10
          </td>
          <td>
            10.1
          </td>
        </tr>
        <tr>
          <td>
            20
          </td>
          <td>
            20.1
          </td>
        </tr>
      </tbody>
    </table>

  </body>
</HTML>

I'm not sure if such example should be part of Gumbo or Cascadia or even EzXML.jl

Anyway none of this project show example with HTML tables... so there is probably a room for doc improvement.

Kind regards

PS : related SO post https://stackoverflow.com/questions/42915962/extracting-and-constructing-tables-from-html-files-using-julia

s-celles avatar Jun 20 '19 07:06 s-celles

I wrote this code (which can help those who are looking for a similar feature) but this code is just a (very) quick implementation... which probably won't work with more complex HTML page with tables

s-celles avatar Jun 20 '19 08:06 s-celles

Hi Sébastien,

Thanks for opening the issue, I agree this would be a good thing to have. I'd rather not have a dependency on DataFrames in this package, since it's a large dependency that's not necessary for Gumbo's core functionality.

My impression is that the best way to do this would be to implement the Tables.jl interface for HTMLElement{:table}, and then we'll be able to construct DataFrames from HTML tables in a very direct, natural way.

I'm not sure when I'll have time to do this, but I don't think it would be very difficult; if someone else wants to take a crack at it I'd happily accept a pull request. I'm happy to add a dependency on Tables.jl, since it's pretty small.

porterjamesj avatar Jun 20 '19 15:06 porterjamesj

I really the idea of implementing Tables.jl interface for HTMLElement{:table} Pinging @quinnj @davidanthoff

s-celles avatar Jun 20 '19 15:06 s-celles

Yeah, it sounds like a great idea. Happy to help support however I can here. Currently, Tables.jl doesn't have a concept of streaming multiple tables at a time, but as long as there's a way to "select" a single table tag and "stream" that, then it should work pretty well. Happy to chat on slack if anyone wants to brainstorm this.

quinnj avatar Jun 20 '19 15:06 quinnj

@quinnj yeah, I think we're on the same page. I'm imagining that it's up to the user to locate a single <table> element in their HTML and pass that into the DataFrame constructor (or whatever else that uses the tables interface).

I'm actually pretty excited about this idea, since this is a feature request that's come up before, and I love the smooth interoperability between the whole ecosystem that packages like Tables can provide! I'll try to find time to work on it soon, I'll ask on Slack if I get stuck with anything Tables related.

porterjamesj avatar Jun 20 '19 16:06 porterjamesj

Cool, yeah, just let me know if you run into any issues. Just to get the ball rolling, some things to think about include:

  • I'm not sure it makes sense to define the Tables.jl interface on HTMLElement{:table} directly, perhaps you'd want a dedicated HTMLTable type that could wrap the element node
  • Feel free to overload Tables.table(x::HTMLElement{:table}) for this, or just use your own constructor
  • The initial setup is pretty simple, including:
Tables.istable(::Type{<:HTMLTable}) = true
Tables.rowaccess(::Type{<:HTMLTable}) = true
Tables.rows(table::HTMLTable) = table
  • The trickier part will be implementing Tables.schema(x::HTMLTable), since it doesn't seem like you'll necessarily have the notion of a "schema" in an HTML table; for starters, you could just do Tables.schema(x::HTMLTable) = nothing, which introduces a little performance hit for sinks, but in the case of HTML tables, I don't think it should be significant
  • Apart from that, the other meat is to define proper iteration on HTMLTable; probably simplest to just iterate NamedTuples. Again, I'm not sure if there will be issues with HTML tables that don't have column names (you might have to auto-generate them if not), but it should be pretty straightforward implementing of the iteration protocol

Anyway, hopefully that gets the ball rolling and again, just let me know if you run into any issues.

quinnj avatar Jun 20 '19 16:06 quinnj

Thanks! That all makes sense. I agree there are some tricky parts and some places that'll have to use heuristics and guessing (for schemas, types, etc.). I think it's fine to just "do our best" and then people can clean things up themselves if they end up with messy data. The only thing I'm curious about is what the utility of the wrapper type (HTMLTable) is vs. defining the Tables interface directly on HTMLElement{:table}?

porterjamesj avatar Jun 20 '19 17:06 porterjamesj

The main decision there is whether you're comfortable defining iterate(x::HTMLElement{:table}) to iterate NamedTuples. To me, that seems maybe a little weird, hence the suggestion to use an explicit wrapper type that provides a proper object to iterate NamedTuples. But then again, I'm not familiar w/ the details of the package very well, so feel free to make the call.

quinnj avatar Jun 20 '19 17:06 quinnj

Ahh, that make sense—I didn't realize the Tables interface required overriding the Base iterate function. I agree a wrapper type makes sense given that, we probably want iterate to iterate child elements for all HTMLElements.

porterjamesj avatar Jun 20 '19 18:06 porterjamesj

Any update on this? It would be great to get a Table from HTMLElement{:table}.

Nosferican avatar Mar 26 '20 18:03 Nosferican

If we fix #85, we can just use AcuteML which already supports Tables.jl.

https://github.com/aminya/AcuteML.jl

aminya avatar Jun 25 '20 04:06 aminya

Nothing yet?

Nosferican avatar Jul 17 '23 22:07 Nosferican