Wikipedia-API
Wikipedia-API copied to clipboard
Page HTML does not include hyperlinks and lists
code used:
import wikipediaapi
wiki_html = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.HTML)
page_html = wiki_html.page("List_of_anime_distributed_in_the_United_States")
print(page_html.text)
example part of result:
Even though these films wer
en't very successful at the time, due to limited release, they did get positive reviews by critics and <i>Akira</i> received a cult following. Most of these films did get higher
-quality dubs later on.
</p><p>A list of anime first distributed in the U.S. during the 1980s includes:
</p>
<h2>1990s</h2>
<p>The 1990s, was the period in which anime reached mainstream popularity in the U.S. market
example part of actual html:
Even though these films weren't very successful at the time, due to limited release, they did get positive reviews by critics and <i>Akira</i> received a cult following. Most of these films did get higher-quality dubs later on.
</p><p>A list of anime first distributed in the U.S. during the 1980s includes:
</p>
<div class="div-col columns column-width" style="-moz-column-width: 22em; -webkit-column-width: 22em; column-width: 22em;">
<ul><li><i><a href="/wiki/Huckleberry_no_B%C5%8Dken" title="Huckleberry no Bōken">Adventures of Huckleberry Finn</a></i></li>
<li><i><a href="/wiki/Pinocchio:_The_Series#English_versions" title="Pinocchio: The Series">The Adventures of Pinocchio</a></i></li>
as you can see, this is not true HTML, a lot of tags like hyperlinks and the entire list section is missing.
this makes the html part of this api almost unusable.
Any fixes for this?
I ran into the same problem. The good news is using the MediaWiki Get HTML endpoint directly is quite simple.
Their python example is incorrect since the endpoint returns html
and not json
. It should be like this:
# Get the content of the Jupiter article on English Wikipedia in HTML
import requests
url = 'https://en.wikipedia.org/w/rest.php/v1/page/Jupiter/html'
headers = {
'User-Agent': 'MediaWiki REST API docs examples/0.1 (https://www.mediawiki.org/wiki/API_talk:REST_API)'
}
response = requests.get(url, headers=headers)
data = response.text
print(data)