newspaper4k Website login functionality

Issue by nup002 Fri Jun 29 20:26:48 2018 Originally opened as https://github.com/codelucas/newspaper/issues/587

This is not so much an issue as a suggestion/request. This suggestion came up as I needed to pull articles from a website with a paywall. I have a user, but could not for the life of me figure out how get a logged in session going in Newspaper.

If the solution is simple and I am just too daft to understand it, feel free to come with suggestions.

Oct 24 '23 12:10 AndyTheFactory

Comment by njwfish Sun Jul 1 17:49:24 2018

bumping

Oct 24 '23 12:10 AndyTheFactory

Comment by nup002 Mon Jul 2 08:14:06 2018

I think I have found a workaround. It simply requires all requests to be done with a Session object. You would first create a Session object, login to the website with the Session using a tool such as Selenium, and then pass the Session object to Newspaper to be used in all future requests.

Changes needed: In: configuration.py

Add "session = requests.Session()" as a new parameter.

In: network.py

add "session = config.session" to function "get_html_2XX_only()".
Replace "response = requests.get" with "response = session.get" in function "get_html_2XX_only()".
Add "self.session = config.session" to class "Mrequest", function "init()".
Replace "self.resp = requests.get" with "self.resp = session.get" in class "Mrequest", function "send()"

When you have logged into your Session object with Selenium, you set the session parameter in Configuration.py to be this object.

Oct 24 '23 12:10 AndyTheFactory

Comment by chsuong Tue Jul 10 05:09:25 2018

@nup002, what about a website that uses logins and cookies?

Oct 24 '23 12:10 AndyTheFactory

Comment by timzhangau Thu Sep 13 12:56:59 2018

I would also like to know any good practice to use newspaper for website with login, especially oauth2 authentication

Oct 24 '23 12:10 AndyTheFactory

Comment by BastianZim Wed Jan 23 15:39:16 2019

In case anyone finds this via Google, check out #668 as well, has some helpful suggestions.

Oct 24 '23 12:10 AndyTheFactory

Comment by karam93 Fri Jan 25 12:34:01 2019

@nup002, could you post an example of the way you can stay log in and bypass cookies

Oct 24 '23 12:10 AndyTheFactory

https://github.com/AndyTheFactory/newspaper4k/issues/668 has not been created yet afaict

Perhaps you mean https://github.com/codelucas/newspaper/issues/668 but I am not sure how that is relevant?

Jun 10 '24 13:06 2dareis2do

I have looked at the examples here https://github.com/johnbumgarner/newspaper3_usage_overview

It looks like they are able to overcome issue that requires confirmation step using a combination of selenium webdriver, newspaper and beautiful soup.

Is it possible to pass cookie key and value when making a request with Newspaper 4k?

e.g. this page https://www.skysports.com/football/live-blog/11661/12476234/transfer-centre-live-luis-guilherme-joao-palhinha-jadon-sancho-latest

Use's a confirmation message. e.g. Once accepted, the following cookies are set

consentUUID euconsent-v2

Once cookies are set the confirmation page is no longer shown.

Is it possible for newspaper4k to be able to pass cookie keys and values across? Perhaps there is a better way to achieve the same?

Jun 10 '24 16:06 2dareis2do

Just to follow up on this I was able follow the Playwright example here https://github.com/AndyTheFactory/newspaper4k/issues/220

Basically, I am using Playwright to click any cookie compliance while making sure the same context is applied when checking the for existence of relevant content. Note that the button in this case is contained within an iframe.

from playwright.sync_api import sync_playwright
import newspaper

def accept_cookies_and_fetch_article(url):
    # Using Playwright to handle login and fetch article
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # Set headless=False to watch the browser actions
        page = browser.new_page()

        # create a new incognito browser context
        context = browser.new_context()
        # create a new page inside context.
        page = context.new_page()

        page.goto(url)
        
        # Automating iframe button click
        page.frame_locator("iframe[title=\"SP Consent Message\"]").get_by_label("Essential cookies only").click()

        content = page.content()
        # dispose context once it is no longer needed.
        context.close()
        browser.close()

    # Using Newspaper4k to parse the page content
    article = newspaper.article(url, input_html=content, language='en')

    return article

# Example URL
url = 'https://www.skysports.com/football/live-blog/11661/12476234/transfer-centre-live-luis-guilherme-joao-palhinha-jadon-sancho-latest'

# Fetch and process the article
article = accept_cookies_and_fetch_article(url)
article.nlp()
print(f"Title: {article.title}")
print(f"Authors: {article.authors}")
print(f"Publication Date: {article.publish_date}")
print(f"Summary: {article.summary}")
print(f"Text: {article.text}")
print(f"Keywords: {article.keywords}")

Jun 11 '24 15:06 2dareis2do