Wikipedia
Wikipedia copied to clipboard
is it possible to find a page by URL?
Currently this module has the option to find a page by title
or a numeric id
. However if you're crawling, you have just a link. This would be especially useful for names with disambiguation. Would it be possible to add an argument to wikipedia.page
which allows for a url
parameter which would be exclusive with title
and id
?
Currently this module has the option to find a page by
title
or a numericid
. However if you're crawling, you have just a link. This would be especially useful for names with disambiguation. Would it be possible to add an argument towikipedia.page
which allows for aurl
parameter which would be exclusive withtitle
andid
?
Do you solve it?
Hey @boompig , @nttmac ,
I had the same problem and I solved it as follow:
from bs4 import BeautifulSoup
import requests
url="https://en.wikipedia.org/wiki/U.S._state"
response = requests.get(url)
bs_object = BeautifulSoup(response.text)
paragraphs = bs_object.select('p')
paragraphs = [paragraph.text for paragraph in paragraphs]
With these lines of code you are able to extract all the plain text paragraphs of your Wikipedia page.
In addition, if you want to remove the citation symbol (e.g. [1], [2], [3]) from the paragraphs, you can also run:
import re
paragraphs = [re.sub(r'\[\d+\]','',paragraph.strip()) for paragraph in paragraphs if paragraph.strip() != '']
I hope this will help.
I had the same problem and solved it like this:
import wikipedia
url = 'https://en.wikipedia.org/wiki/David_Walker_(journalist)'
title = url[url.rindex('/') + 1:len(url)]
page = wikipedia.page(title=title, auto_suggest=False)