trafficstars

I was trying to scrape the content from Skyscanner.net with fields as Origin, Destination, Price, Departure time, Arrival time but it is giving error as below

Please provide the following travel details: Departure Airport (e.g., JFK): DEL
Date of Departure (YYYY-MM-DD): 2024-12-12 Hour of Departure (24-hour format, e.g., 14:00): 16:05 Destination Airport (e.g., LAX): BLR Details saved to CSV file successfully. [INIT].... → Crawl4AI 0.4.1 [ERROR]... × https://www.skyscanner.co.in/transport/flights/del... | Error: ┌───────────────────────────────────────────────────────────────────────────────┐ │ × async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. │ │ =========================== logs =========================== │ │ "load" event fired │ │ ============================================================ │ └───────────────────────────────────────────────────────────────────────────────┘

Failed to crawl the URL: https://www.skyscanner.co.in/transport/flights/del/blr/241212/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&inboundaltsenabled=false&infants=0&outboundaltsenabled=false&preferdirects=false&ref=home&rtn=0 Error: async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. =========================== logs =========================== "load" event fired

How can we fix this so that it can seamlessly and also there is a button "show more results" where it has the remaining data. So how can we extract all of the data present in the website using Crawl4ai.

Dec 11 '24 13:12 Shuaib11-Github

+1

Dec 13 '24 05:12 Navanit-git

Hi @Shuaib11-Github (and anyone else facing similar issues),

The problem you’re encountering with Skyscanner and similar dynamic websites is that they employ strong anti-bot and anti-scraping measures. When you try to load the page programmatically, you might pass initial checks like a random user agent, but the website can still detect that it’s not a real browser session or a genuine user. As a result, you hit a “bot detection” wall.

I’ve attached images below to illustrate what happens:

Bot Detection Screen:

Initially, you may see a challenge page or some form of verification step.
Passing the Detection:

If you use a managed browser session and interact with the site as a real browser would, you can get past this stage. The browser retains your state, cookies, and other identifying factors, so once you pass the verification step once, subsequent crawls from the same user directory are recognized as a genuine session.
Success & Extracted Data:

After successfully bypassing detection, Crawl4AI can extract the page content as intended.

Because scenarios like this are common, I’m adding this explanation as a reference tutorial. This way, whenever someone encounters a similar problem, they can refer back to these steps and examples.

Tutorial: Dealing with Anti-Bot Measures

Many modern sites, especially those dealing with travel, e-commerce, or finance, have robust anti-bot systems. They detect non-human browsing patterns and headless browsers. While setting a random user agent often works for simpler pages, you may need a more advanced approach for tougher sites.

Key Strategies:

First Step with User Agent Randomization
Before delving into managed browsers, first try the simplest approach:
- Set user_agent_mode="random" in BrowserConfig.
- Run your crawl to see if the site allows you through without additional measures.
If this step doesn’t work and you still encounter bot detection or challenges, then proceed to the more robust solution using a managed browser and persistent user data.

Use a Managed Browser:
By enabling use_managed_browser in BrowserConfig, you’re effectively launching a full browser instance with persistent user data. This lets the site identify you as a returning user and not a fresh “bot” each time.

For example, you might do:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Configure the browser
    browser_config = BrowserConfig(
        headless=False,  # Set to False so you can see what's happening
        verbose=True,
        user_agent_mode="random",
        use_managed_browser=True, # Enables persistent browser sessions
        browser_type="chromium",
        user_data_dir="/path/to/your_chrome_user_data"
    )

    # Set crawl configuration
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=DefaultMarkdownGenerator()
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.skyscanner.co.in/transport/flights/del/",
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))

if __name__ == "__main__":
    asyncio.run(main())

First Run - Pass the Challenge Manually:
The first time you run it, keep headless=False so you can see the browser. If the website shows a CAPTCHA or challenge, solve it manually in the opened browser window. Once done, that session (stored in user_data_dir) will “remember” that you’ve passed the challenge.
Subsequent Crawls - Automatic Access:
On future runs, you can enable headless=True since the site now recognizes your browser session. This gives you full automation for extraction without the bot detection popping up every time.

In Summary:

Basic pages: Try headless=True with a random user agent (default config).
Tough anti-bot pages: Use a managed browser with a user data directory and interact with the site once manually.
After passing the initial verification step, you can crawl the site as if you were a regular user, allowing you to gather all the data you need.

This approach makes Crawl4AI much more versatile, enabling you to tackle even heavily protected sites.

Dec 13 '24 13:12 unclecode

So magic mode doesn't currently work in cases like this?

Dec 16 '24 14:12 blghtr

@unclecode Got the below when ran the code

self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy( TypeError: crawl4ai.async_crawler_strategy.AsyncPlaywrightCrawlerStrategy() got multiple values for keyword argument 'browser_config'

Dec 16 '24 14:12 Shuaib11-Github

@Shuaib11-Github my bad! In AsyncWebCrawler constructor should be config=... not browser_config=..., I edit it now!

@blghtr I’ll add this to magic mode as well. When you set magic=True, it will switch to a managed browser, create a temporary user directory, set a random user agent, and then, once everything is done, either remove the directory or reuse it later.

Dec 16 '24 15:12 unclecode

@unclecode with headless=False, I got the below

[INIT].... → Crawl4AI 0.4.22 [WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence. [FETCH]... ↓ https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.96s [SCRAPE].. ◆ Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 23ms [COMPLETE] ● https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 2.00s Raw Markdown Length: 371 Citations Markdown Length: 371 [INFO].... ℹ Browser process terminated normally | Code: 1

when changed to headless=True, I got the below

[INIT].... → Crawl4AI 0.4.22 [WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence. [FETCH]... ↓ https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.27s [SCRAPE].. ◆ Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 9ms [COMPLETE] ● https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 1.28s Raw Markdown Length: 371 Citations Markdown Length: 371

How can I extract the flight details and make sure it is further saved in any format. Atleast if can store the details in Markdown then I can further make sure to save it as csv file. But I need only the data to be extracted for respective flights as per the user input requests.

Dec 16 '24 15:12 Shuaib11-Github

@Shuaib11-Github Look at the following code:


async def main():
    # Configure the browser
    browser_config = BrowserConfig(
        headless=False,  # Set to False so you can see what's happening
        verbose=True,
        user_agent_mode="random",
        use_managed_browser=True,  # Enables persistent browser sessions
        browser_type="chromium",
        user_data_dir="/Users/unclecode/.user_data_dir",
    )

    schema = {
        "name": "Skyscanner Place Cards",
        "baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
        "fields": [
            {
                "name": "city_name",
                "selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
                "type": "text",
            },
            {
                "name": "country_name",
                "selector": "span[class*='PlaceCard_subName__']",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "span[class*='PlaceCard_advertLabel__']",
                "type": "text",
            },
            {
                "name": "flight_price",
                "selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
                "type": "text",
            },
            {
                "name": "flight_type",
                "selector": "a[data-testid='flights-link'] .BpkText_bpk-text--body-default__",
                "type": "text",
            },
            {
                "name": "flight_url",
                "selector": "a[data-testid='flights-link']",
                "type": "attribute",
                "attribute": "href",
            },
            {
                "name": "hotels_url",
                "selector": "a[data-testid='hotels-link']",
                "type": "attribute",
                "attribute": "href",
            },
        ],
    }

    # Set crawl configuration
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        wait_for="css:div[class^='PlaceCard_descriptionContainer__']",
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.skyscanner.co.in/transport/flights/del/",
            config=crawl_config,
            
        )

        if result.success:
            companies = json.loads(result.extracted_content)
            print(f"Successfully extracted {len(companies)} companies")
            print(json.dumps(companies[0], indent=2))


if __name__ == "__main__":
    asyncio.run(main())

INIT].... → Crawl4AI 0.4.23
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[FETCH]... ↓ https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.88s
[SCRAPE].. ◆ Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 265ms
[EXTRACT]. ■ Completed for https://www.skyscanner.co.in/transport/flights/del... | Time: 0.10316416597925127s
[COMPLETE] ● https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 2.25s
Successfully extracted 9 companies
{
  "country_name": "Saudi Arabia",
  "description": "This land is calling. Step into Saudi, the heart of Arabia.",
  "flight_url": "https://www.skyscanner.co.in/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501",
  "hotels_url": "/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501&hotelsselected=true"
}
[INFO].... ℹ Browser process terminated normally | Code: 0

Just pay attention to something very important when I run this code for the first time. When I pass a new directory, what happens is that I set a breakpoint, for example, and the line that I check determines if the result is successful or not. When run the code, headless, is set to false, the code wait, I can see the browser asking me to prove that I am human. I complete the proof test, and once it's approved, the page displays. Then, I stop the whole process and run the code again, and from this point because it uses the directory I created, which contains my human identity information, it works effectively.

As you can see, I use, for example, the JsonCssExtractionStrategy, and I have been able to extract data in JSON format that you want. It worth to mention I also used wait_for, its a must. You can also use LLMExtraction, or you can just store the Markdown. But the key point is that you understand how to handle the managed browser.

Dec 16 '24 16:12 unclecode

But I need data in the below format

[ { "origin": "DEL", "destination": "BLR", "departure_time": "08:00", "arrival_time": "10:50" }, { "origin": "DEL", "destination": "BLR", "departure_time": "05:55", "arrival_time": "09:05" }, { "origin": "DEL", "destination": "BLR", "departure_time": "08:00", "arrival_time": "10:50" }, { "origin": "DEL", "destination": "BLR", "departure_time": "03:30", "arrival_time": "06:20" }, { "origin": "DEL", "destination": "BLR", "departure_time": "21:35", "arrival_time": "00:25" }, { "origin": "DEL", "destination": "BLR", "departure_time": "08:10", "arrival_time": "13:45" }, { "origin": "DEL", "destination": "BLR", "departure_time": "21:50", "arrival_time": "00:40" }, { "origin": "DEL", "destination": "BLR", "departure_time": "17:40", "arrival_time": "20:30" }, { "origin": "DEL", "destination": "BLR", "departure_time": "08:10", "arrival_time": "13:45" }, { "origin": "DEL", "destination": "BLR", "departure_time": "11:45", "arrival_time": "14:35" } ]

For the entire month or so. As user gives origin of the flight and then the code will fetch the origin, destination, departure, arrival and Price of the flight for the entire month without failing for any of the provided input and robust to any input and also it should be saved locally to check if it is working or not

Dec 16 '24 16:12 Shuaib11-Github

got the below when changed with headless=True, for the second time

[INIT].... → Crawl4AI 0.4.22 [WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence. [ERROR]... × https://www.skyscanner.co.in/transport/flights/del... | Error: ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ × Unexpected error in crawl_web at line 899 in crawl_web (..\anaconda3\envs\crawl\lib\site- │ │ packages\crawl4ai\async_crawler_strategy.py): │ │ Error: Wait condition failed: Timeout after 60000ms waiting for selector │ │ 'div[class^='PlaceCard_descriptionContainer']' │ │ │ │ Code context: │ │ 894 # Handle wait_for condition │ │ 895 if config.wait_for: │ │ 896 try: │ │ 897 await self.smart_wait(page, config.wait_for, timeout=config.page_timeout) │ │ 898 except Exception as e: │ │ 899 → raise RuntimeError(f"Wait condition failed: {str(e)}") │ │ 900 │ │ 901 # Update image dimensions if needed │ │ 902 if not self.browser_config.text_only: │ │ 903 update_image_dimensions_js = load_js_script("update_image_dimensions") │ │ 904 try:

Dec 16 '24 17:12 Shuaib11-Github

@Shuaib11-Github 1/ Did you start to use managed browser? 2/ Looking at the structure of the data you need, I see that it does not come entirely from the links you provided. Those links are insufficient because they only contain some packages. To obtain your data, you should search for that specific date and time. I will share an example of the links.

https://www.skyscanner.co.in/transport/flights/del/blr/250101/250201/?adultsv2=1&cabinclass=economy&childrenv2=&inboundaltsenabled=false&outboundaltsenabled=false&preferdirects=false&rtn=1&priceSourceId=&priceTrace=202412151014IDELBLR20250101goibAI%7C202412151014IBLRDEL20250201goib6E&qp_prevCurrency=INR&qp_prevPrice=16287&qp_prevProvider=ins_month

Are you referring to extracting information from this page? If so, this means you build url dynamically in your application. and then pass it to Crawl4ai for extraction, is that correct?

Dec 17 '24 08:12 unclecode

Basically here the user inputs the origin of the flight and then based on that all available flights for that month for different locations need to be extracted.

So I need data as below

Origin, Destination, Departure time, Arrival time, Date, Price

On Tue, 17 Dec, 2024, 2:07 pm UncleCode, @.***> wrote:

@Shuaib11-Github https://github.com/Shuaib11-Github 1/ Did you start to use managed browser? 2/ Looking at the structure of the data you need, I see that it does not come entirely from the links you provided. Those links are insufficient because they only contain some packages. To obtain your data, you should search for that specific date and time. I will share an example of the links.

https://www.skyscanner.co.in/transport/flights/del/blr/250101/250201/?adultsv2=1&cabinclass=economy&childrenv2=&inboundaltsenabled=false&outboundaltsenabled=false&preferdirects=false&rtn=1&priceSourceId=&priceTrace=202412151014IDELBLR20250101goibAI%7C202412151014IBLRDEL20250201goib6E&qp_prevCurrency=INR&qp_prevPrice=16287&qp_prevProvider=ins_month

Are you referring to extracting information from this page? If so, this means you build url dynamically in your application. and then pass it to Crawl4ai for extraction, is that correct?

image.png (view on web) https://github.com/user-attachments/assets/983db133-8ceb-40bc-bcb2-4c5ca348b7f5

— Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2547810303, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2DSQLR4DUUSOAALPCT2F7PEDAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBXHAYTAMZQGM . You are receiving this because you were mentioned.Message ID: @.***>

Dec 17 '24 13:12 Shuaib11-Github

@Shuaib11-Github Ok, I work on this by this week, and share the code for you, been a little busy with documentation, please stay tune

Dec 23 '24 08:12 unclecode

Ok, thanks for the update.

On Mon, 23 Dec, 2024, 2:06 pm UncleCode, @.***> wrote:

@Shuaib11-Github https://github.com/Shuaib11-Github Ok, I work on this by this week, and share the code for you, been a little busy with documentation, please stay tune

— Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2559181957, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2GPIVBV3XBK7U3D4WL2G7DPJAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJZGE4DCOJVG4 . You are receiving this because you were mentioned.Message ID: @.***>

Dec 23 '24 11:12 Shuaib11-Github

You're welcome.

Dec 25 '24 07:12 unclecode

Did you tried to scrape Skyscanner for flight details

On Wed, 25 Dec, 2024, 1:29 pm UncleCode, @.***> wrote:

You're welcome.

— Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2561697105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2AKBDJ25AKUZUX6M4D2HJQXHAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRGY4TOMJQGU . You are receiving this because you were mentioned.Message ID: @.***>

Dec 25 '24 10:12 Shuaib11-Github

@Shuaib11-Github Not yet. As I mentioned earlier this week, I will check it. Your website link has remained open in my browser since that day :D I will definitely check it.

Dec 25 '24 12:12 unclecode

Ok, let me know.

On Wed, Dec 25, 2024 at 6:19 PM UncleCode @.***> wrote:

@Shuaib11-Github https://github.com/Shuaib11-Github Not yet. As I mentioned earlier this week, I will check it. Your website link has remained open in my browser since that day :D I will definitely check it.

— Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2561877971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2G2GI265X6SU2T662T2HKSVTAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRHA3TOOJXGE . You are receiving this because you were mentioned.Message ID: @.***>

Dec 25 '24 15:12 Shuaib11-Github

Curious how this resolved? coudl we use capsolverr or a third party API to do that aspect for us, instead of magic mode for this type of use case?

Jan 07 '25 17:01 rdvo

@unclecode Unfortunately, I have problems too.

I decided to try to rewrite my scraper, which worked on undetected-playwright (it worked without problems) using crawl4ai.

As a result, without using a managed browser, most of the content is blocked (Access to XMLHttpRequest at ... has been blocked by CORS policy) even with magic=True.

And with its use, nothing happens at all: the browser window opens and the code execution hangs there (without trying to open the page).

from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async def test_allmusic_access():
    # Browser configuration
    browser_config = BrowserConfig(
        headless=False,
        use_managed_browser=True,
        browser_type="chromium",
        user_data_dir="path/to/my/userdata/dir"
    )

    # Basic crawler configuration
    crawler_config = CrawlerRunConfig(
        magic=True
    )

    async with AsyncWebCrawler(
        verbose=True,
        config=browser_config
    ) as crawler:
        result = await crawler.arun(
            url="https://www.allmusic.com/artist/ringo-starr-mn0000217792",
            config=crawler_config
        )

        print("\nAccess test results:")
        print(f"Success: {result.success}")
        print(f"Status code: {result.status_code}")

if __name__ == "__main__":
    asyncio.run(test_allmusic_access())

Jan 12 '25 13:01 blghtr

crawl4ai
crawl4ai copied to clipboard

Regarding scrpaping of Dynamic website like Skyscanner.net

Tutorial: Dealing with Anti-Bot Measures

crawl4ai crawl4ai copied to clipboard

Regarding scrpaping of Dynamic website like Skyscanner.net

Tutorial: Dealing with Anti-Bot Measures

crawl4ai
crawl4ai copied to clipboard