crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Browser is not supported: Select playwright browser="firefox"

Open dcolley opened this issue 1 year ago • 1 comments
trafficstars

How can I select a browser that is supported by the target web site?

Running this test in crawl4ai produces the incorrect login screen:

async def main():
    crawler_strategy = AsyncPlaywrightCrawlerStrategy(
        verbose=True,
        headless=HEADLESS, # this is the only place where headless works...
        user_agent=USER_AGENT
    )
    # crawler_strategy.set_hook('on_browser_created', on_browser_created)

    async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
        result = await crawler.arun(
            headless=HEADLESS,
            browser="firefox", # not working...
            user_agent=USER_AGENT,
            url='https://x.com/home',
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                # provider="openai/gpt-4o",
                provider="openai/llama3.2",
                # provider="openai/Meta-Llama-3-8B-Instruct-GGUF",
                base_url="http://192.168.1.107:1234/v1",
                # base_url="http://localhost:1234/v1",
                verbose=True,
                # graph_config=graph_config,
                # api_token=os.getenv('OPENAI_API_KEY'),
                api_token=api_token, 
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""
                From the crawled content, extract tweets on the user's timeline.
                One extracted model JSON format should look like this: 
                {"title": "GPT-4", "text": "this is a tweet", "url": "https://x.com/tweet/link"}.
                """
            ),            
            bypass_cache=True,
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

The browser is Chromium

image


Running this test in PlayWright produces the correct login screen:

def test_twitter_login_page(page: Page):
    page.goto("https://x.com/")

    # Expect the page to have a heading with the name of "Log in to Twitter".
    page.screenshot(path="screenshot.png")
    expect(page.get_by_role("button", name="Log in")).to_be_visible()
pytest --browser=firefox

... the screen shows a login page: image

dcolley avatar Oct 04 '24 09:10 dcolley

So far I found the async browser is hard-coded.

https://github.com/unclecode/crawl4ai/blob/4750810a67aba2b257a8c8a6d234d2cf397bd025/crawl4ai/async_crawler_strategy.py#L93

Edit this to use firefox. However, for some reason x.com still displays 'unsupported browser'...

dcolley avatar Oct 07 '24 09:10 dcolley

@dcolley Thank you for your interest and a very good point to highlight. We have already added the ability to set the browser type, and then you can simply pass it to your web crawler instantly. We plan to release this new version by tomorrow, version 0.3.6. This is a sample of the code that you can try. You may set headless to False to see the browser.

import asyncio
from crawl4ai import AsyncWebCrawler
import time
async def main():
    # Use Firefox
    start = time.time()
    async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

    # Use WebKit
    start = time.time()
    async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

    # Use Chromium (default)
    start = time.time()
    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

if __name__ == "__main__":
    asyncio.run(main())

unclecode avatar Oct 13 '24 10:10 unclecode