Website login functionality
Issue by nup002
Fri Jun 29 20:26:48 2018
Originally opened as https://github.com/codelucas/newspaper/issues/587
This is not so much an issue as a suggestion/request. This suggestion came up as I needed to pull articles from a website with a paywall. I have a user, but could not for the life of me figure out how get a logged in session going in Newspaper.
If the solution is simple and I am just too daft to understand it, feel free to come with suggestions.
Comment by nup002
Mon Jul 2 08:14:06 2018
I think I have found a workaround. It simply requires all requests to be done with a Session object. You would first create a Session object, login to the website with the Session using a tool such as Selenium, and then pass the Session object to Newspaper to be used in all future requests.
Changes needed: In: configuration.py
- Add "session = requests.Session()" as a new parameter.
In: network.py
- add "session = config.session" to function "get_html_2XX_only()".
- Replace "response = requests.get" with "response = session.get" in function "get_html_2XX_only()".
- Add "self.session = config.session" to class "Mrequest", function "init()".
- Replace "self.resp = requests.get" with "self.resp = session.get" in class "Mrequest", function "send()"
When you have logged into your Session object with Selenium, you set the session parameter in Configuration.py to be this object.
Comment by chsuong
Tue Jul 10 05:09:25 2018
@nup002, what about a website that uses logins and cookies?
Comment by timzhangau
Thu Sep 13 12:56:59 2018
I would also like to know any good practice to use newspaper for website with login, especially oauth2 authentication
Comment by BastianZim
Wed Jan 23 15:39:16 2019
In case anyone finds this via Google, check out #668 as well, has some helpful suggestions.
Comment by karam93
Fri Jan 25 12:34:01 2019
@nup002, could you post an example of the way you can stay log in and bypass cookies
https://github.com/AndyTheFactory/newspaper4k/issues/668 has not been created yet afaict
Perhaps you mean https://github.com/codelucas/newspaper/issues/668 but I am not sure how that is relevant?
I have looked at the examples here https://github.com/johnbumgarner/newspaper3_usage_overview
It looks like they are able to overcome issue that requires confirmation step using a combination of selenium webdriver, newspaper and beautiful soup.
Is it possible to pass cookie key and value when making a request with Newspaper 4k?
e.g. this page https://www.skysports.com/football/live-blog/11661/12476234/transfer-centre-live-luis-guilherme-joao-palhinha-jadon-sancho-latest
Use's a confirmation message. e.g. Once accepted, the following cookies are set
consentUUID euconsent-v2
Once cookies are set the confirmation page is no longer shown.
Is it possible for newspaper4k to be able to pass cookie keys and values across? Perhaps there is a better way to achieve the same?
Just to follow up on this I was able follow the Playwright example here https://github.com/AndyTheFactory/newspaper4k/issues/220
Basically, I am using Playwright to click any cookie compliance while making sure the same context is applied when checking the for existence of relevant content. Note that the button in this case is contained within an iframe.
from playwright.sync_api import sync_playwright
import newspaper
def accept_cookies_and_fetch_article(url):
# Using Playwright to handle login and fetch article
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # Set headless=False to watch the browser actions
page = browser.new_page()
# create a new incognito browser context
context = browser.new_context()
# create a new page inside context.
page = context.new_page()
page.goto(url)
# Automating iframe button click
page.frame_locator("iframe[title=\"SP Consent Message\"]").get_by_label("Essential cookies only").click()
content = page.content()
# dispose context once it is no longer needed.
context.close()
browser.close()
# Using Newspaper4k to parse the page content
article = newspaper.article(url, input_html=content, language='en')
return article
# Example URL
url = 'https://www.skysports.com/football/live-blog/11661/12476234/transfer-centre-live-luis-guilherme-joao-palhinha-jadon-sancho-latest'
# Fetch and process the article
article = accept_cookies_and_fetch_article(url)
article.nlp()
print(f"Title: {article.title}")
print(f"Authors: {article.authors}")
print(f"Publication Date: {article.publish_date}")
print(f"Summary: {article.summary}")
print(f"Text: {article.text}")
print(f"Keywords: {article.keywords}")