Passing arguments to hook to perform basic auth
Hello! I am working on a solution where I use scrapy to crawl through several levels of a website and crawl4AI to extract the content. Currently, I need to support basic authentication and I am trying a solution with hooks (already found something similar in the issues section here). I have this hook that should receive a username and password in the parameters.
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser, BrowserContext
async def before_goto(page, **kwargs):
# kwargs might contain original_url and session_id etc.
# Store original_url somewhere if needed, or print
await page.set_extra_http_headers({'Authorization': #Create Base64 with username:password})
and this is the code that calls the hook:
import base64
from typing import Any, AsyncIterator
from urllib.parse import urlparse
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlResult
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from scrapy.crawler import Crawler
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from playwright.async_api import Page, Browser, BrowserContext
from common.connector import IndexableContent
from connectors.webcrawler.args import WebCrawlerConnectorArgs
from connectors.webcrawler.basic_auth import BasicAuth
from connectors.webcrawler.scrapy_webcrawler.scrapy_webcrawler.hooks import before_goto
class WebCrawlerSpider(CrawlSpider):
name = "webcrawler"
rules = (Rule(LinkExtractor(), callback="parse_item", follow=True),)
def __init__(
self,
connector_args: WebCrawlerConnectorArgs,
documents: list[IndexableContent],
*args,
**kwargs,
):
super().__init__(*args, **kwargs)
# If start_urls empty, quit (?)
self.start_urls = (
connector_args.urls
if connector_args.urls
else []
)
self.allowed_domains = self.extract_domains(connector_args.urls)
self.documents: list[IndexableContent] = documents
self.authentication = connector_args.authentication
if isinstance(self.authentication, BasicAuth):
self.http_user = self.authentication.username
self.http_auth_domain = None
self.http_pass = self.authentication.password
@classmethod
def from_crawler(
cls,
crawler: Crawler,
connector_args: WebCrawlerConnectorArgs,
documents: list[IndexableContent],
*args: Any,
**kwargs: Any,
) -> "WebCrawlerSpider":
crawler.settings.set("CRAWLSPIDER_FOLLOW_LINKS", connector_args.crawl_depth > 0)
spider = super().from_crawler(
crawler, connector_args, documents, *args, **kwargs
)
spider.settings.set(
"DEPTH_LIMIT",
connector_args.crawl_depth,
priority="spider",
)
spider.settings.set(
"ITEM_PIPELINES",
{
"scrapy_webcrawler.scrapy_webcrawler.pipelines.ScrapyWebcrawlerPipeline": 300,
},
)
spider.settings.set(
"TWISTED_REACTOR",
"twisted.internet.asyncioreactor.AsyncioSelectorReactor",
)
return spider
def extract_domains(self, urls: list[str]) -> list[str]:
"""
Extracts domains from a list of URLs.
Args:
urls (list[str]): A list of URLs.
Returns:
list[str]: A list of domains extracted from the URLs.
"""
domains = []
for url in urls:
parsed_url = urlparse(url)
if parsed_url.netloc:
domain = parsed_url.netloc
# Remove 'www.' only if it is at the start of the domain
if domain.startswith("www."):
domain = domain[4:]
domains.append(domain)
return domains
def parse_start_url(self, response):
return self.parse_item(response)
async def parse_item(self, response) -> AsyncIterator[IndexableContent]:
# Extract to method
# We need to put a domain here, of the basic auth website, with multiple start_urls this becomes confusing
crawl_result = await self.process_url(response.url)
document = IndexableContent(
identifier="id1",
content=str(crawl_result.markdown),
title="title",
url=response.url,
metadata={},
)
self.documents.append(document)
yield document
def start_requests(self):
for url in self.start_urls:
yield Request(url=url)
async def process_url(self, url) -> CrawlResult:
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True,)
crawler_strategy.set_hook('before_goto', before_goto)
await crawler_strategy.execute_hook('before_goto', //Pass here the username and password)
async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
return await crawler.arun(
url=url,
cache_mode=CacheMode.BYPASS,
exclude_external_links=True,
exclude_social_media_links=True,
)
The main problem here is that I don't know how to obtain the page parameter. Can you help me here? Also, is this the correct way to support basic auth? Thank you in advance
@luisferreira93 Thanks for using Crawl4ai. I have a few things to explain to make the job easier for you. Before I explain, I want to let you know that we will release our scraping module very soon. It is under review and will provide a lot of efficiency. I definitely suggest you use this for scraping. Now, back to your questions; I will add some explanations and show you some code examples for more clarity.
Let me address your questions and suggest some improvements to make your code more efficient:
-
Hook Selection: Instead of using
before_goto, I recommend using theon_page_context_createdhook for setting authentication headers. This hook is more appropriate as it's called right after a new page context is created, ensuring your headers are set up properly. -
Browser Instance Management: Currently, you're creating a new crawler instance for each URL. This is inefficient as it involves starting and stopping the browser repeatedly. Let's improve this by creating the crawler once and reusing it.
Here's an improved version of your code:
class WebCrawlerSpider(CrawlSpider):
def __init__(self, connector_args, documents, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_urls = connector_args.urls if connector_args.urls else []
self.allowed_domains = self.extract_domains(connector_args.urls)
self.documents = documents
self.authentication = connector_args.authentication
# Set up the crawler strategy with authentication
async def on_page_context_created(page, **kwargs):
if isinstance(self.authentication, BasicAuth):
credentials = base64.b64encode(
f"{self.authentication.username}:{self.authentication.password}".encode()
).decode()
await page.set_extra_http_headers({
'Authorization': f'Basic {credentials}'
})
self.crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
self.crawler_strategy.set_hook('on_page_context_created', on_page_context_created)
self.crawler = AsyncWebCrawler(
verbose=True,
crawler_strategy=self.crawler_strategy
)
async def spider_opened(self):
"""Initialize crawler when spider starts"""
await self.crawler.start()
async def spider_closed(self):
"""Clean up crawler when spider finishes"""
await self.crawler.close()
async def process_url(self, url) -> CrawlResult:
return await self.crawler.arun(
url=url,
cache_mode=CacheMode.BYPASS,
exclude_external_links=True,
exclude_social_media_links=True,
)
Key improvements in this code:
-
Better Hook: Using
on_page_context_createdinstead ofbefore_gotoensures headers are set immediately after a page context is created. -
Efficient Browser Management: The crawler is created once in
__init__and managed throughspider_openedandspider_closed. This prevents the overhead of creating/destroying browser instances for each URL. -
Clean Authentication: The authentication logic is encapsulated in the hook function, making it cleaner and more maintainable.
To use this code, you don't need to manually execute the hook or worry about the page parameter - the crawler strategy will handle that for you. The hook will be called automatically with the correct page instance whenever a new page context is created.
For example usage with explicit lifecycle management:
# Initialize the spider
spider = WebCrawlerSpider(connector_args, documents)
# Start the crawler
await spider.spider_opened()
try:
# Process URLs
for url in spider.start_urls:
result = await spider.process_url(url)
# Handle result...
finally:
# Clean up
await spider.spider_closed()
This approach is much more efficient as it:
- Reuses the browser instance across multiple URLs
- Properly manages resources
- Handles authentication consistently
- Integrates well with Scrapy's lifecycle
Let me know if you need any clarification or have questions about implementing these improvements!
Hello @unclecode, thank you for your help. It has been valuable. We can close this 🙏🏻
You're welcome
Hi @unclecode,
Thank you for your support. Myself and @luisferreira93 have tested your solution in the latest version of Crawl4AI (0.4.247) by using this test page and we are getting the following net::ERR_INVALID_AUTH_CREDENTIALS error:
[ERROR]... × https://testpages.eviltester.com/styled/auth/basic... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 1205 in _crawl_web (.venv/lib/python3.12/site- │
│ packages/crawl4ai/async_crawler_strategy.py): │
│ Error: Failed on navigating ACS-GOTO: │
│ Page.goto: net::ERR_INVALID_AUTH_CREDENTIALS at https://testpages.eviltester.com/styled/auth/basic-auth- │
│ results.html │
│ Call log: │
│ - navigating to "https://testpages.eviltester.com/styled/auth/basic-auth-results.html", waiting until │
│ "domcontentloaded" │
│ │
│ │
│ Code context: │
│ 1200 │
│ 1201 response = await page.goto( │
│ 1202 url, wait_until=config.wait_until, timeout=config.page_timeout │
│ 1203 ) │
│ 1204 except Error as e: │
│ 1205 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │
│ 1206 │
│ 1207 await self.execute_hook("after_goto", page, context=context, url=url, response=response) │
│ 1208 │
│ 1209 if response is None: │
│ 1210 status_code = 200 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
I've created this reproducer project that replicates the issue.
We believe this is caused by this Content-Security-Policy header, as the request succeeds when we remove it.
We've worked around the error that I've mentioned in the previous comment by setting the Authorization header in a Playwright route, as exemplified here.
@jl-martins That's correct. I was going to suggest using hooks to set whatever you want in the header, but I noticed you have already done it here, and you've done it very well. I might take your code, modify it as an example, and add it to our documentation. Additionally, you can pass the header in another way when you are crawling. You don't necessarily need to use the hook solely for this. You can set BrowsewrConfig.headers which accepts a dictionary contains headers you want to be set before go to the url (like {"Authorization": f"Basic {credentials}"}). Moreover, in the upcoming new version, you can actually pass a list of rules (pattern: str, Callable) to set as the router for the page object. Anyway, thanks for the detailed update.