Wikipedia icon indicating copy to clipboard operation
Wikipedia copied to clipboard

Accessing Tables?

Open NYamaguchi415 opened this issue 9 years ago • 5 comments

Is there any way to access information from a table on a page?

I can go through the html and use beautifulsoup, but the tables on some pages don't have unique identifiers that I can use to select specific tables and it's become a bit of a pain.

NYamaguchi415 avatar Dec 09 '15 05:12 NYamaguchi415

did you ever figure this out? needing to do the same

vesper8 avatar Dec 12 '16 09:12 vesper8

same here. Does some of you have a solution?

scholi avatar May 17 '18 09:05 scholi

^ same

stella-lu avatar May 21 '18 10:05 stella-lu

I have solved by making a function from scratch to export tables.

I'm using MediaWikiAPI which is a more updated fork of this project. It has the same structure so I think you should get the same results by substituting MediaWikiAPI with wikipedia

Sample of how it works:

# load page 
mediawikiapi = MediaWikiAPI()
test_page = mediawikiapi.page(PageWithTables)

# scrape the HTML with BeautifulSoup to find tables
soup = BeautifulSoup(test_page.html(), 'html.parser')
tables = soup.findAll("table", { "class" : "wikitable" })

# select target table and apply custom function to export it to pandas
target_table = tables[0]
df_test = wikitable_to_dataframe(target_table)

Here's the full procedure and the function wikitable_to_dataframe : https://gist.github.com/giovannibonaccorsi/6ba30ec92894130c67258ffc6e09c9a4#file-export_wikipedia_tables_to_pandas-py

gibbbone avatar Jun 12 '18 22:06 gibbbone

@gibbbone Thanks!

The gist you posted to the wikitable_to_dataframe code is dead, but I found that you can use pandas to create a list of the tables as datafames:

import pandas as pd
pd.read_html(test_page.url, attrs={"class": "wikitable"})

stevans avatar Sep 14 '20 21:09 stevans