Wikipedia
Wikipedia copied to clipboard
Accessing Tables?
Is there any way to access information from a table on a page?
I can go through the html and use beautifulsoup, but the tables on some pages don't have unique identifiers that I can use to select specific tables and it's become a bit of a pain.
did you ever figure this out? needing to do the same
same here. Does some of you have a solution?
^ same
I have solved by making a function from scratch to export tables.
I'm using MediaWikiAPI which is a more updated fork of this project. It has the same structure so I think you should get the same results by substituting MediaWikiAPI
with wikipedia
Sample of how it works:
# load page
mediawikiapi = MediaWikiAPI()
test_page = mediawikiapi.page(PageWithTables)
# scrape the HTML with BeautifulSoup to find tables
soup = BeautifulSoup(test_page.html(), 'html.parser')
tables = soup.findAll("table", { "class" : "wikitable" })
# select target table and apply custom function to export it to pandas
target_table = tables[0]
df_test = wikitable_to_dataframe(target_table)
Here's the full procedure and the function wikitable_to_dataframe
:
https://gist.github.com/giovannibonaccorsi/6ba30ec92894130c67258ffc6e09c9a4#file-export_wikipedia_tables_to_pandas-py
@gibbbone Thanks!
The gist you posted to the wikitable_to_dataframe
code is dead, but I found that you can use pandas to create a list of the tables as datafames:
import pandas as pd
pd.read_html(test_page.url, attrs={"class": "wikitable"})