Wikipedia
Wikipedia copied to clipboard
is it possible to find a page by URL?
Currently this module has the option to find a page by title or a numeric id. However if you're crawling, you have just a link. This would be especially useful for names with disambiguation. Would it be possible to add an argument to wikipedia.page which allows for a url parameter which would be exclusive with title and id?
Currently this module has the option to find a page by
titleor a numericid. However if you're crawling, you have just a link. This would be especially useful for names with disambiguation. Would it be possible to add an argument towikipedia.pagewhich allows for aurlparameter which would be exclusive withtitleandid?
Do you solve it?
Hey @boompig , @nttmac ,
I had the same problem and I solved it as follow:
from bs4 import BeautifulSoup
import requests
url="https://en.wikipedia.org/wiki/U.S._state"
response = requests.get(url)
bs_object = BeautifulSoup(response.text)
paragraphs = bs_object.select('p')
paragraphs = [paragraph.text for paragraph in paragraphs]
With these lines of code you are able to extract all the plain text paragraphs of your Wikipedia page.
In addition, if you want to remove the citation symbol (e.g. [1], [2], [3]) from the paragraphs, you can also run:
import re
paragraphs = [re.sub(r'\[\d+\]','',paragraph.strip()) for paragraph in paragraphs if paragraph.strip() != '']
I hope this will help.
I had the same problem and solved it like this:
import wikipedia
url = 'https://en.wikipedia.org/wiki/David_Walker_(journalist)'
title = url[url.rindex('/') + 1:len(url)]
page = wikipedia.page(title=title, auto_suggest=False)