Convert HTML file with table(s) to DataFrame
Hello,
I have an HTML file with a table and would like to convert it to a Julia DataFrame.
I was looking for a function similar to Python Pandas read_html function (which directly output a list of DataFrame).
Unfortunately I don't see similar function in Julia ecosystem
In Gumbo doc I was looking for an example to iterate over rows and colums of each table
here is a basic HTML source file with 2 tables
<!DOCTYPE >
<HTML>
<head></head>
<body>
<h1>First table</h1>
<table>
<tbody>
<tr>
<th>
A
</th>
<th>
B
</th>
</tr>
<tr>
<td>
1
</td>
<td>
1.1
</td>
</tr>
<tr>
<td>
2
</td>
<td>
2.1
</td>
</tr>
</tbody>
</table>
<h1>Second table</h1>
<table>
<tbody>
<tr>
<th>
AA
</th>
<th>
BB
</th>
</tr>
<tr>
<td>
10
</td>
<td>
10.1
</td>
</tr>
<tr>
<td>
20
</td>
<td>
20.1
</td>
</tr>
</tbody>
</table>
</body>
</HTML>
I'm not sure if such example should be part of Gumbo or Cascadia or even EzXML.jl
Anyway none of this project show example with HTML tables... so there is probably a room for doc improvement.
Kind regards
PS : related SO post https://stackoverflow.com/questions/42915962/extracting-and-constructing-tables-from-html-files-using-julia
I wrote this code (which can help those who are looking for a similar feature) but this code is just a (very) quick implementation... which probably won't work with more complex HTML page with tables
Hi Sébastien,
Thanks for opening the issue, I agree this would be a good thing to have. I'd rather not have a dependency on DataFrames in this package, since it's a large dependency that's not necessary for Gumbo's core functionality.
My impression is that the best way to do this would be to implement the Tables.jl interface for HTMLElement{:table}, and then we'll be able to construct DataFrames from HTML tables in a very direct, natural way.
I'm not sure when I'll have time to do this, but I don't think it would be very difficult; if someone else wants to take a crack at it I'd happily accept a pull request. I'm happy to add a dependency on Tables.jl, since it's pretty small.
I really the idea of implementing Tables.jl interface for HTMLElement{:table}
Pinging @quinnj @davidanthoff
Yeah, it sounds like a great idea. Happy to help support however I can here. Currently, Tables.jl doesn't have a concept of streaming multiple tables at a time, but as long as there's a way to "select" a single table tag and "stream" that, then it should work pretty well. Happy to chat on slack if anyone wants to brainstorm this.
@quinnj yeah, I think we're on the same page. I'm imagining that it's up to the user to locate a single <table> element in their HTML and pass that into the DataFrame constructor (or whatever else that uses the tables interface).
I'm actually pretty excited about this idea, since this is a feature request that's come up before, and I love the smooth interoperability between the whole ecosystem that packages like Tables can provide! I'll try to find time to work on it soon, I'll ask on Slack if I get stuck with anything Tables related.
Cool, yeah, just let me know if you run into any issues. Just to get the ball rolling, some things to think about include:
- I'm not sure it makes sense to define the Tables.jl interface on
HTMLElement{:table}directly, perhaps you'd want a dedicatedHTMLTabletype that could wrap the element node - Feel free to overload
Tables.table(x::HTMLElement{:table})for this, or just use your own constructor - The initial setup is pretty simple, including:
Tables.istable(::Type{<:HTMLTable}) = true
Tables.rowaccess(::Type{<:HTMLTable}) = true
Tables.rows(table::HTMLTable) = table
- The trickier part will be implementing
Tables.schema(x::HTMLTable), since it doesn't seem like you'll necessarily have the notion of a "schema" in an HTML table; for starters, you could just doTables.schema(x::HTMLTable) = nothing, which introduces a little performance hit for sinks, but in the case of HTML tables, I don't think it should be significant - Apart from that, the other meat is to define proper iteration on
HTMLTable; probably simplest to just iterate NamedTuples. Again, I'm not sure if there will be issues with HTML tables that don't have column names (you might have to auto-generate them if not), but it should be pretty straightforward implementing of the iteration protocol
Anyway, hopefully that gets the ball rolling and again, just let me know if you run into any issues.
Thanks! That all makes sense. I agree there are some tricky parts and some places that'll have to use heuristics and guessing (for schemas, types, etc.). I think it's fine to just "do our best" and then people can clean things up themselves if they end up with messy data. The only thing I'm curious about is what the utility of the wrapper type (HTMLTable) is vs. defining the Tables interface directly on HTMLElement{:table}?
The main decision there is whether you're comfortable defining iterate(x::HTMLElement{:table}) to iterate NamedTuples. To me, that seems maybe a little weird, hence the suggestion to use an explicit wrapper type that provides a proper object to iterate NamedTuples. But then again, I'm not familiar w/ the details of the package very well, so feel free to make the call.
Ahh, that make sense—I didn't realize the Tables interface required overriding the Base iterate function. I agree a wrapper type makes sense given that, we probably want iterate to iterate child elements for all HTMLElements.
Any update on this? It would be great to get a Table from HTMLElement{:table}.
If we fix #85, we can just use AcuteML which already supports Tables.jl.
https://github.com/aminya/AcuteML.jl
Nothing yet?