crawl4ai
crawl4ai copied to clipboard
Browser is not supported: Select playwright browser="firefox"
How can I select a browser that is supported by the target web site?
Running this test in crawl4ai produces the incorrect login screen:
async def main():
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
verbose=True,
headless=HEADLESS, # this is the only place where headless works...
user_agent=USER_AGENT
)
# crawler_strategy.set_hook('on_browser_created', on_browser_created)
async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
result = await crawler.arun(
headless=HEADLESS,
browser="firefox", # not working...
user_agent=USER_AGENT,
url='https://x.com/home',
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# provider="openai/gpt-4o",
provider="openai/llama3.2",
# provider="openai/Meta-Llama-3-8B-Instruct-GGUF",
base_url="http://192.168.1.107:1234/v1",
# base_url="http://localhost:1234/v1",
verbose=True,
# graph_config=graph_config,
# api_token=os.getenv('OPENAI_API_KEY'),
api_token=api_token,
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""
From the crawled content, extract tweets on the user's timeline.
One extracted model JSON format should look like this:
{"title": "GPT-4", "text": "this is a tweet", "url": "https://x.com/tweet/link"}.
"""
),
bypass_cache=True,
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
The browser is Chromium
Running this test in PlayWright produces the correct login screen:
def test_twitter_login_page(page: Page):
page.goto("https://x.com/")
# Expect the page to have a heading with the name of "Log in to Twitter".
page.screenshot(path="screenshot.png")
expect(page.get_by_role("button", name="Log in")).to_be_visible()
pytest --browser=firefox
... the screen shows a login page:
So far I found the async browser is hard-coded.
https://github.com/unclecode/crawl4ai/blob/4750810a67aba2b257a8c8a6d234d2cf397bd025/crawl4ai/async_crawler_strategy.py#L93
Edit this to use firefox. However, for some reason x.com still displays 'unsupported browser'...
@dcolley Thank you for your interest and a very good point to highlight. We have already added the ability to set the browser type, and then you can simply pass it to your web crawler instantly. We plan to release this new version by tomorrow, version 0.3.6. This is a sample of the code that you can try. You may set headless to False to see the browser.
import asyncio
from crawl4ai import AsyncWebCrawler
import time
async def main():
# Use Firefox
start = time.time()
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use WebKit
start = time.time()
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use Chromium (default)
start = time.time()
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
if __name__ == "__main__":
asyncio.run(main())