crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Passing arguments to hook to perform basic auth

Open luisferreira93 opened this issue 1 year ago • 3 comments

Hello! I am working on a solution where I use scrapy to crawl through several levels of a website and crawl4AI to extract the content. Currently, I need to support basic authentication and I am trying a solution with hooks (already found something similar in the issues section here). I have this hook that should receive a username and password in the parameters.

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser, BrowserContext

async def before_goto(page, **kwargs):
    # kwargs might contain original_url and session_id etc.
    # Store original_url somewhere if needed, or print
    await page.set_extra_http_headers({'Authorization': #Create Base64 with username:password})

and this is the code that calls the hook:

import base64
from typing import Any, AsyncIterator
from urllib.parse import urlparse

from crawl4ai import AsyncWebCrawler, CacheMode, CrawlResult
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from scrapy.crawler import Crawler
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from playwright.async_api import Page, Browser, BrowserContext

from common.connector import IndexableContent
from connectors.webcrawler.args import WebCrawlerConnectorArgs
from connectors.webcrawler.basic_auth import BasicAuth
from connectors.webcrawler.scrapy_webcrawler.scrapy_webcrawler.hooks import before_goto


class WebCrawlerSpider(CrawlSpider):
    name = "webcrawler"

    rules = (Rule(LinkExtractor(), callback="parse_item", follow=True),)

    def __init__(
        self,
        connector_args: WebCrawlerConnectorArgs,
        documents: list[IndexableContent],
        *args,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        # If start_urls empty, quit (?)
        self.start_urls = (
            connector_args.urls
            if connector_args.urls
            else []
        )
        self.allowed_domains = self.extract_domains(connector_args.urls)
        self.documents: list[IndexableContent] = documents
        self.authentication = connector_args.authentication
        if isinstance(self.authentication, BasicAuth):
            self.http_user = self.authentication.username
            self.http_auth_domain = None
            self.http_pass = self.authentication.password

    @classmethod
    def from_crawler(
        cls,
        crawler: Crawler,
        connector_args: WebCrawlerConnectorArgs,
        documents: list[IndexableContent],
        *args: Any,
        **kwargs: Any,
    ) -> "WebCrawlerSpider":
        crawler.settings.set("CRAWLSPIDER_FOLLOW_LINKS", connector_args.crawl_depth > 0)
        spider = super().from_crawler(
            crawler, connector_args, documents, *args, **kwargs
        )
        spider.settings.set(
            "DEPTH_LIMIT",
            connector_args.crawl_depth,
            priority="spider",
        )
        spider.settings.set(
            "ITEM_PIPELINES",
            {
                "scrapy_webcrawler.scrapy_webcrawler.pipelines.ScrapyWebcrawlerPipeline": 300,
            },
        )
        spider.settings.set(
            "TWISTED_REACTOR",
            "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        )
        return spider

    def extract_domains(self, urls: list[str]) -> list[str]:
        """
        Extracts domains from a list of URLs.

        Args:
            urls (list[str]): A list of URLs.

        Returns:
            list[str]: A list of domains extracted from the URLs.
        """
        domains = []
        for url in urls:
            parsed_url = urlparse(url)
            if parsed_url.netloc:
                domain = parsed_url.netloc
                # Remove 'www.' only if it is at the start of the domain
                if domain.startswith("www."):
                    domain = domain[4:]
                domains.append(domain)
        return domains

    def parse_start_url(self, response):
        return self.parse_item(response)

    async def parse_item(self, response) -> AsyncIterator[IndexableContent]:
        # Extract to method
        # We need to put a domain here, of the basic auth website, with multiple start_urls this becomes confusing
        crawl_result = await self.process_url(response.url)
        document = IndexableContent(
            identifier="id1",
            content=str(crawl_result.markdown),
            title="title",
            url=response.url,
            metadata={},
        )
        self.documents.append(document)
        yield document

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url)

    async def process_url(self, url) -> CrawlResult:
        crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True,)
        crawler_strategy.set_hook('before_goto', before_goto)
        await crawler_strategy.execute_hook('before_goto', //Pass here the username and password)
        async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
            return await crawler.arun(
                url=url,
                cache_mode=CacheMode.BYPASS,
                exclude_external_links=True,
                exclude_social_media_links=True,
            )

The main problem here is that I don't know how to obtain the page parameter. Can you help me here? Also, is this the correct way to support basic auth? Thank you in advance

luisferreira93 avatar Dec 19 '24 16:12 luisferreira93

@luisferreira93 Thanks for using Crawl4ai. I have a few things to explain to make the job easier for you. Before I explain, I want to let you know that we will release our scraping module very soon. It is under review and will provide a lot of efficiency. I definitely suggest you use this for scraping. Now, back to your questions; I will add some explanations and show you some code examples for more clarity.

Let me address your questions and suggest some improvements to make your code more efficient:

  1. Hook Selection: Instead of using before_goto, I recommend using the on_page_context_created hook for setting authentication headers. This hook is more appropriate as it's called right after a new page context is created, ensuring your headers are set up properly.

  2. Browser Instance Management: Currently, you're creating a new crawler instance for each URL. This is inefficient as it involves starting and stopping the browser repeatedly. Let's improve this by creating the crawler once and reusing it.

Here's an improved version of your code:

class WebCrawlerSpider(CrawlSpider):
    def __init__(self, connector_args, documents, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = connector_args.urls if connector_args.urls else []
        self.allowed_domains = self.extract_domains(connector_args.urls)
        self.documents = documents
        self.authentication = connector_args.authentication
        
        # Set up the crawler strategy with authentication
        async def on_page_context_created(page, **kwargs):
            if isinstance(self.authentication, BasicAuth):
                credentials = base64.b64encode(
                    f"{self.authentication.username}:{self.authentication.password}".encode()
                ).decode()
                await page.set_extra_http_headers({
                    'Authorization': f'Basic {credentials}'
                })

        self.crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
        self.crawler_strategy.set_hook('on_page_context_created', on_page_context_created)
        self.crawler = AsyncWebCrawler(
            verbose=True, 
            crawler_strategy=self.crawler_strategy
        )
        
    async def spider_opened(self):
        """Initialize crawler when spider starts"""
        await self.crawler.start()
        
    async def spider_closed(self):
        """Clean up crawler when spider finishes"""
        await self.crawler.close()

    async def process_url(self, url) -> CrawlResult:
        return await self.crawler.arun(
            url=url,
            cache_mode=CacheMode.BYPASS,
            exclude_external_links=True,
            exclude_social_media_links=True,
        )

Key improvements in this code:

  1. Better Hook: Using on_page_context_created instead of before_goto ensures headers are set immediately after a page context is created.

  2. Efficient Browser Management: The crawler is created once in __init__ and managed through spider_opened and spider_closed. This prevents the overhead of creating/destroying browser instances for each URL.

  3. Clean Authentication: The authentication logic is encapsulated in the hook function, making it cleaner and more maintainable.

To use this code, you don't need to manually execute the hook or worry about the page parameter - the crawler strategy will handle that for you. The hook will be called automatically with the correct page instance whenever a new page context is created.

For example usage with explicit lifecycle management:

# Initialize the spider
spider = WebCrawlerSpider(connector_args, documents)

# Start the crawler
await spider.spider_opened()

try:
    # Process URLs
    for url in spider.start_urls:
        result = await spider.process_url(url)
        # Handle result...
finally:
    # Clean up
    await spider.spider_closed()

This approach is much more efficient as it:

  • Reuses the browser instance across multiple URLs
  • Properly manages resources
  • Handles authentication consistently
  • Integrates well with Scrapy's lifecycle

Let me know if you need any clarification or have questions about implementing these improvements!

unclecode avatar Dec 25 '24 11:12 unclecode

Hello @unclecode, thank you for your help. It has been valuable. We can close this 🙏🏻

luisferreira93 avatar Jan 09 '25 11:01 luisferreira93

You're welcome

unclecode avatar Jan 10 '25 12:01 unclecode

Hi @unclecode,

Thank you for your support. Myself and @luisferreira93 have tested your solution in the latest version of Crawl4AI (0.4.247) by using this test page and we are getting the following net::ERR_INVALID_AUTH_CREDENTIALS error:

[ERROR]... × https://testpages.eviltester.com/styled/auth/basic... | Error: 
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1205 in _crawl_web (.venv/lib/python3.12/site-                               │
│ packages/crawl4ai/async_crawler_strategy.py):                                                                         │
│   Error: Failed on navigating ACS-GOTO:                                                                               │
│   Page.goto: net::ERR_INVALID_AUTH_CREDENTIALS at https://testpages.eviltester.com/styled/auth/basic-auth-            │
│ results.html                                                                                                          │
│   Call log:                                                                                                           │
│   - navigating to "https://testpages.eviltester.com/styled/auth/basic-auth-results.html", waiting until               │
│ "domcontentloaded"                                                                                                    │
│                                                                                                                       │
│                                                                                                                       │
│   Code context:                                                                                                       │
│   1200                                                                                                                │
│   1201                       response = await page.goto(                                                              │
│   1202                           url, wait_until=config.wait_until, timeout=config.page_timeout                       │
│   1203                       )                                                                                        │
│   1204                   except Error as e:                                                                           │
│   1205 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")                          │
│   1206                                                                                                                │
│   1207                   await self.execute_hook("after_goto", page, context=context, url=url, response=response)     │
│   1208                                                                                                                │
│   1209                   if response is None:                                                                         │
│   1210                       status_code = 200                                                                        │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

I've created this reproducer project that replicates the issue.

We believe this is caused by this Content-Security-Policy header, as the request succeeds when we remove it.

jl-martins avatar Jan 16 '25 16:01 jl-martins

We've worked around the error that I've mentioned in the previous comment by setting the Authorization header in a Playwright route, as exemplified here.

jl-martins avatar Jan 16 '25 20:01 jl-martins

@jl-martins That's correct. I was going to suggest using hooks to set whatever you want in the header, but I noticed you have already done it here, and you've done it very well. I might take your code, modify it as an example, and add it to our documentation. Additionally, you can pass the header in another way when you are crawling. You don't necessarily need to use the hook solely for this. You can set BrowsewrConfig.headers which accepts a dictionary contains headers you want to be set before go to the url (like {"Authorization": f"Basic {credentials}"}). Moreover, in the upcoming new version, you can actually pass a list of rules (pattern: str, Callable) to set as the router for the page object. Anyway, thanks for the detailed update.

unclecode avatar Jan 17 '25 13:01 unclecode