Wikipedia-API icon indicating copy to clipboard operation
Wikipedia-API copied to clipboard

Page HTML does not include hyperlinks and lists

Open laundmo opened this issue 4 years ago • 3 comments

code used:

import wikipediaapi
wiki_html =  wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.HTML)
page_html = wiki_html.page("List_of_anime_distributed_in_the_United_States")
print(page_html.text)

example part of result:

Even though these films wer
en't very successful at the time, due to limited release, they did get positive reviews by critics and <i>Akira</i> received a cult following. Most of these films did get higher
-quality dubs later on.
</p><p>A list of anime first distributed in the U.S. during the 1980s includes:
</p>

<h2>1990s</h2>
<p>The 1990s, was the period in which anime reached mainstream popularity in the U.S. market

example part of actual html:

Even though these films weren't very successful at the time, due to limited release, they did get positive reviews by critics and <i>Akira</i> received a cult following. Most of these films did get higher-quality dubs later on.
</p><p>A list of anime first distributed in the U.S. during the 1980s includes:
</p>
<div class="div-col columns column-width" style="-moz-column-width: 22em; -webkit-column-width: 22em; column-width: 22em;">
<ul><li><i><a href="/wiki/Huckleberry_no_B%C5%8Dken" title="Huckleberry no Bōken">Adventures of Huckleberry Finn</a></i></li>
<li><i><a href="/wiki/Pinocchio:_The_Series#English_versions" title="Pinocchio: The Series">The Adventures of Pinocchio</a></i></li>

as you can see, this is not true HTML, a lot of tags like hyperlinks and the entire list section is missing.

this makes the html part of this api almost unusable.

laundmo avatar Mar 09 '20 05:03 laundmo

Any fixes for this?

chiranchimmili avatar Jun 25 '21 23:06 chiranchimmili

I ran into the same problem. The good news is using the MediaWiki Get HTML endpoint directly is quite simple.

Their python example is incorrect since the endpoint returns html and not json. It should be like this:

# Get the content of the Jupiter article on English Wikipedia in HTML
import requests

url = 'https://en.wikipedia.org/w/rest.php/v1/page/Jupiter/html'
headers = {
    'User-Agent': 'MediaWiki REST API docs examples/0.1 (https://www.mediawiki.org/wiki/API_talk:REST_API)'
}

response = requests.get(url, headers=headers)
data = response.text

print(data)

tlietz avatar Feb 17 '24 00:02 tlietz