Wikipedia icon indicating copy to clipboard operation
Wikipedia copied to clipboard

is it possible to find a page by URL?

Open boompig opened this issue 4 years ago • 3 comments

Currently this module has the option to find a page by title or a numeric id. However if you're crawling, you have just a link. This would be especially useful for names with disambiguation. Would it be possible to add an argument to wikipedia.page which allows for a url parameter which would be exclusive with title and id?

boompig avatar May 26 '20 18:05 boompig

Currently this module has the option to find a page by title or a numeric id. However if you're crawling, you have just a link. This would be especially useful for names with disambiguation. Would it be possible to add an argument to wikipedia.page which allows for a url parameter which would be exclusive with title and id?

Do you solve it?

nttmac avatar Jul 21 '20 13:07 nttmac

Hey @boompig , @nttmac ,

I had the same problem and I solved it as follow:

from bs4 import BeautifulSoup
import requests

url="https://en.wikipedia.org/wiki/U.S._state"
response = requests.get(url)
bs_object = BeautifulSoup(response.text)
paragraphs = bs_object.select('p')
paragraphs = [paragraph.text for paragraph in paragraphs]

With these lines of code you are able to extract all the plain text paragraphs of your Wikipedia page.

In addition, if you want to remove the citation symbol (e.g. [1], [2], [3]) from the paragraphs, you can also run:

import re
paragraphs = [re.sub(r'\[\d+\]','',paragraph.strip())  for paragraph in paragraphs if paragraph.strip() != '']

I hope this will help.

antoniolanza1996 avatar Sep 13 '20 19:09 antoniolanza1996

I had the same problem and solved it like this:

import wikipedia
url = 'https://en.wikipedia.org/wiki/David_Walker_(journalist)'
title = url[url.rindex('/') + 1:len(url)]
page = wikipedia.page(title=title, auto_suggest=False)

cinek1 avatar Jan 24 '21 13:01 cinek1