crawl4ai
crawl4ai copied to clipboard
Regarding scrpaping of Dynamic website like Skyscanner.net
I was trying to scrape the content from Skyscanner.net with fields as Origin, Destination, Price, Departure time, Arrival time but it is giving error as below
Please provide the following travel details:
Departure Airport (e.g., JFK): DEL
Date of Departure (YYYY-MM-DD): 2024-12-12
Hour of Departure (24-hour format, e.g., 14:00): 16:05
Destination Airport (e.g., LAX): BLR
Details saved to CSV file successfully.
[INIT].... β Crawl4AI 0.4.1
[ERROR]... Γ https://www.skyscanner.co.in/transport/flights/del... | Error:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Γ async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. β
β =========================== logs =========================== β
β "load" event fired β
β ============================================================ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Failed to crawl the URL: https://www.skyscanner.co.in/transport/flights/del/blr/241212/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&inboundaltsenabled=false&infants=0&outboundaltsenabled=false&preferdirects=false&ref=home&rtn=0 Error: async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. =========================== logs =========================== "load" event fired
How can we fix this so that it can seamlessly and also there is a button "show more results" where it has the remaining data. So how can we extract all of the data present in the website using Crawl4ai.
+1
Hi @Shuaib11-Github (and anyone else facing similar issues),
The problem youβre encountering with Skyscanner and similar dynamic websites is that they employ strong anti-bot and anti-scraping measures. When you try to load the page programmatically, you might pass initial checks like a random user agent, but the website can still detect that itβs not a real browser session or a genuine user. As a result, you hit a βbot detectionβ wall.
Iβve attached images below to illustrate what happens:
-
Bot Detection Screen:
Initially, you may see a challenge page or some form of verification step. -
Passing the Detection:
If you use a managed browser session and interact with the site as a real browser would, you can get past this stage. The browser retains your state, cookies, and other identifying factors, so once you pass the verification step once, subsequent crawls from the same user directory are recognized as a genuine session. -
Success & Extracted Data:
After successfully bypassing detection, Crawl4AI can extract the page content as intended.
Because scenarios like this are common, Iβm adding this explanation as a reference tutorial. This way, whenever someone encounters a similar problem, they can refer back to these steps and examples.
Tutorial: Dealing with Anti-Bot Measures
Many modern sites, especially those dealing with travel, e-commerce, or finance, have robust anti-bot systems. They detect non-human browsing patterns and headless browsers. While setting a random user agent often works for simpler pages, you may need a more advanced approach for tougher sites.
Key Strategies:
-
First Step with User Agent Randomization
Before delving into managed browsers, first try the simplest approach:- Set
user_agent_mode="random"inBrowserConfig. - Run your crawl to see if the site allows you through without additional measures.
If this step doesnβt work and you still encounter bot detection or challenges, then proceed to the more robust solution using a managed browser and persistent user data.
- Set
-
Use a Managed Browser:
By enablinguse_managed_browserinBrowserConfig, youβre effectively launching a full browser instance with persistent user data. This lets the site identify you as a returning user and not a fresh βbotβ each time.For example, you might do:
import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator async def main(): # Configure the browser browser_config = BrowserConfig( headless=False, # Set to False so you can see what's happening verbose=True, user_agent_mode="random", use_managed_browser=True, # Enables persistent browser sessions browser_type="chromium", user_data_dir="/path/to/your_chrome_user_data" ) # Set crawl configuration crawl_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, markdown_generator=DefaultMarkdownGenerator() ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://www.skyscanner.co.in/transport/flights/del/", config=crawl_config ) if result.success: print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown)) print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations)) if __name__ == "__main__": asyncio.run(main()) -
First Run - Pass the Challenge Manually:
The first time you run it, keepheadless=Falseso you can see the browser. If the website shows a CAPTCHA or challenge, solve it manually in the opened browser window. Once done, that session (stored inuser_data_dir) will βrememberβ that youβve passed the challenge. -
Subsequent Crawls - Automatic Access:
On future runs, you can enableheadless=Truesince the site now recognizes your browser session. This gives you full automation for extraction without the bot detection popping up every time.
In Summary:
- Basic pages: Try
headless=Truewith a random user agent (default config). - Tough anti-bot pages: Use a managed browser with a user data directory and interact with the site once manually.
- After passing the initial verification step, you can crawl the site as if you were a regular user, allowing you to gather all the data you need.
This approach makes Crawl4AI much more versatile, enabling you to tackle even heavily protected sites.
So magic mode doesn't currently work in cases like this?
@unclecode Got the below when ran the code
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy( TypeError: crawl4ai.async_crawler_strategy.AsyncPlaywrightCrawlerStrategy() got multiple values for keyword argument 'browser_config'
@Shuaib11-Github my bad! In AsyncWebCrawler constructor should be config=... not browser_config=..., I edit it now!
@blghtr Iβll add this to magic mode as well. When you set magic=True, it will switch to a managed browser, create a temporary user directory, set a random user agent, and then, once everything is done, either remove the directory or reuse it later.
@unclecode with headless=False, I got the below
[INIT].... β Crawl4AI 0.4.22 [WARNING]. β Both crawler_config and legacy parameters provided. crawler_config will take precedence. [FETCH]... β https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.96s [SCRAPE].. β Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 23ms [COMPLETE] β https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 2.00s Raw Markdown Length: 371 Citations Markdown Length: 371 [INFO].... βΉ Browser process terminated normally | Code: 1
when changed to headless=True, I got the below
[INIT].... β Crawl4AI 0.4.22 [WARNING]. β Both crawler_config and legacy parameters provided. crawler_config will take precedence. [FETCH]... β https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.27s [SCRAPE].. β Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 9ms [COMPLETE] β https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 1.28s Raw Markdown Length: 371 Citations Markdown Length: 371
How can I extract the flight details and make sure it is further saved in any format. Atleast if can store the details in Markdown then I can further make sure to save it as csv file. But I need only the data to be extracted for respective flights as per the user input requests.
@Shuaib11-Github Look at the following code:
async def main():
# Configure the browser
browser_config = BrowserConfig(
headless=False, # Set to False so you can see what's happening
verbose=True,
user_agent_mode="random",
use_managed_browser=True, # Enables persistent browser sessions
browser_type="chromium",
user_data_dir="/Users/unclecode/.user_data_dir",
)
schema = {
"name": "Skyscanner Place Cards",
"baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
"fields": [
{
"name": "city_name",
"selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
"type": "text",
},
{
"name": "country_name",
"selector": "span[class*='PlaceCard_subName__']",
"type": "text",
},
{
"name": "description",
"selector": "span[class*='PlaceCard_advertLabel__']",
"type": "text",
},
{
"name": "flight_price",
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
"type": "text",
},
{
"name": "flight_type",
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--body-default__",
"type": "text",
},
{
"name": "flight_url",
"selector": "a[data-testid='flights-link']",
"type": "attribute",
"attribute": "href",
},
{
"name": "hotels_url",
"selector": "a[data-testid='hotels-link']",
"type": "attribute",
"attribute": "href",
},
],
}
# Set crawl configuration
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
wait_for="css:div[class^='PlaceCard_descriptionContainer__']",
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.skyscanner.co.in/transport/flights/del/",
config=crawl_config,
)
if result.success:
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
if __name__ == "__main__":
asyncio.run(main())
INIT].... β Crawl4AI 0.4.23
[WARNING]. β Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[FETCH]... β https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.88s
[SCRAPE].. β Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 265ms
[EXTRACT]. β Completed for https://www.skyscanner.co.in/transport/flights/del... | Time: 0.10316416597925127s
[COMPLETE] β https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 2.25s
Successfully extracted 9 companies
{
"country_name": "Saudi Arabia",
"description": "This land is calling. Step into Saudi, the heart of Arabia.",
"flight_url": "https://www.skyscanner.co.in/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501",
"hotels_url": "/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501&hotelsselected=true"
}
[INFO].... βΉ Browser process terminated normally | Code: 0
Just pay attention to something very important when I run this code for the first time. When I pass a new directory, what happens is that I set a breakpoint, for example, and the line that I check determines if the result is successful or not. When run the code, headless, is set to false, the code wait, I can see the browser asking me to prove that I am human. I complete the proof test, and once it's approved, the page displays. Then, I stop the whole process and run the code again, and from this point because it uses the directory I created, which contains my human identity information, it works effectively.
As you can see, I use, for example, the JsonCssExtractionStrategy, and I have been able to extract data in JSON format that you want. It worth to mention I also used wait_for, its a must. You can also use LLMExtraction, or you can just store the Markdown. But the key point is that you understand how to handle the managed browser.
But I need data in the below format
[ { "origin": "DEL", "destination": "BLR", "departure_time": "08:00", "arrival_time": "10:50" }, { "origin": "DEL", "destination": "BLR", "departure_time": "05:55", "arrival_time": "09:05" }, { "origin": "DEL", "destination": "BLR", "departure_time": "08:00", "arrival_time": "10:50" }, { "origin": "DEL", "destination": "BLR", "departure_time": "03:30", "arrival_time": "06:20" }, { "origin": "DEL", "destination": "BLR", "departure_time": "21:35", "arrival_time": "00:25" }, { "origin": "DEL", "destination": "BLR", "departure_time": "08:10", "arrival_time": "13:45" }, { "origin": "DEL", "destination": "BLR", "departure_time": "21:50", "arrival_time": "00:40" }, { "origin": "DEL", "destination": "BLR", "departure_time": "17:40", "arrival_time": "20:30" }, { "origin": "DEL", "destination": "BLR", "departure_time": "08:10", "arrival_time": "13:45" }, { "origin": "DEL", "destination": "BLR", "departure_time": "11:45", "arrival_time": "14:35" } ]
For the entire month or so. As user gives origin of the flight and then the code will fetch the origin, destination, departure, arrival and Price of the flight for the entire month without failing for any of the provided input and robust to any input and also it should be saved locally to check if it is working or not
got the below when changed with headless=True, for the second time
[INIT].... β Crawl4AI 0.4.22 [WARNING]. β Both crawler_config and legacy parameters provided. crawler_config will take precedence. [ERROR]... Γ https://www.skyscanner.co.in/transport/flights/del... | Error: βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Γ Unexpected error in crawl_web at line 899 in crawl_web (..\anaconda3\envs\crawl\lib\site- β β packages\crawl4ai\async_crawler_strategy.py): β β Error: Wait condition failed: Timeout after 60000ms waiting for selector β β 'div[class^='PlaceCard_descriptionContainer']' β β β β Code context: β β 894 # Handle wait_for condition β β 895 if config.wait_for: β β 896 try: β β 897 await self.smart_wait(page, config.wait_for, timeout=config.page_timeout) β β 898 except Exception as e: β β 899 β raise RuntimeError(f"Wait condition failed: {str(e)}") β β 900 β β 901 # Update image dimensions if needed β β 902 if not self.browser_config.text_only: β β 903 update_image_dimensions_js = load_js_script("update_image_dimensions") β β 904 try:
@Shuaib11-Github 1/ Did you start to use managed browser? 2/ Looking at the structure of the data you need, I see that it does not come entirely from the links you provided. Those links are insufficient because they only contain some packages. To obtain your data, you should search for that specific date and time. I will share an example of the links.
https://www.skyscanner.co.in/transport/flights/del/blr/250101/250201/?adultsv2=1&cabinclass=economy&childrenv2=&inboundaltsenabled=false&outboundaltsenabled=false&preferdirects=false&rtn=1&priceSourceId=&priceTrace=202412151014IDELBLR20250101goibAI%7C202412151014IBLRDEL20250201goib6E&qp_prevCurrency=INR&qp_prevPrice=16287&qp_prevProvider=ins_month
Are you referring to extracting information from this page? If so, this means you build url dynamically in your application. and then pass it to Crawl4ai for extraction, is that correct?
Basically here the user inputs the origin of the flight and then based on that all available flights for that month for different locations need to be extracted.
So I need data as below
Origin, Destination, Departure time, Arrival time, Date, Price
On Tue, 17 Dec, 2024, 2:07 pm UncleCode, @.***> wrote:
@Shuaib11-Github https://github.com/Shuaib11-Github 1/ Did you start to use managed browser? 2/ Looking at the structure of the data you need, I see that it does not come entirely from the links you provided. Those links are insufficient because they only contain some packages. To obtain your data, you should search for that specific date and time. I will share an example of the links.
https://www.skyscanner.co.in/transport/flights/del/blr/250101/250201/?adultsv2=1&cabinclass=economy&childrenv2=&inboundaltsenabled=false&outboundaltsenabled=false&preferdirects=false&rtn=1&priceSourceId=&priceTrace=202412151014IDELBLR20250101goibAI%7C202412151014IBLRDEL20250201goib6E&qp_prevCurrency=INR&qp_prevPrice=16287&qp_prevProvider=ins_month
Are you referring to extracting information from this page? If so, this means you build url dynamically in your application. and then pass it to Crawl4ai for extraction, is that correct?
image.png (view on web) https://github.com/user-attachments/assets/983db133-8ceb-40bc-bcb2-4c5ca348b7f5
β Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2547810303, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2DSQLR4DUUSOAALPCT2F7PEDAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBXHAYTAMZQGM . You are receiving this because you were mentioned.Message ID: @.***>
@Shuaib11-Github Ok, I work on this by this week, and share the code for you, been a little busy with documentation, please stay tune
Ok, thanks for the update.
On Mon, 23 Dec, 2024, 2:06 pm UncleCode, @.***> wrote:
@Shuaib11-Github https://github.com/Shuaib11-Github Ok, I work on this by this week, and share the code for you, been a little busy with documentation, please stay tune
β Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2559181957, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2GPIVBV3XBK7U3D4WL2G7DPJAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJZGE4DCOJVG4 . You are receiving this because you were mentioned.Message ID: @.***>
You're welcome.
Did you tried to scrape Skyscanner for flight details
On Wed, 25 Dec, 2024, 1:29 pm UncleCode, @.***> wrote:
You're welcome.
β Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2561697105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2AKBDJ25AKUZUX6M4D2HJQXHAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRGY4TOMJQGU . You are receiving this because you were mentioned.Message ID: @.***>
@Shuaib11-Github Not yet. As I mentioned earlier this week, I will check it. Your website link has remained open in my browser since that day :D I will definitely check it.
Ok, let me know.
On Wed, Dec 25, 2024 at 6:19β―PM UncleCode @.***> wrote:
@Shuaib11-Github https://github.com/Shuaib11-Github Not yet. As I mentioned earlier this week, I will check it. Your website link has remained open in my browser since that day :D I will definitely check it.
β Reply to this email directly, view it on GitHub https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2561877971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNA2G2GI265X6SU2T662T2HKSVTAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRHA3TOOJXGE . You are receiving this because you were mentioned.Message ID: @.***>
Curious how this resolved? coudl we use capsolverr or a third party API to do that aspect for us, instead of magic mode for this type of use case?
@unclecode Unfortunately, I have problems too.
I decided to try to rewrite my scraper, which worked on undetected-playwright (it worked without problems) using crawl4ai.
As a result, without using a managed browser, most of the content is blocked (Access to XMLHttpRequest at ... has been blocked by CORS policy) even with magic=True.
And with its use, nothing happens at all: the browser window opens and the code execution hangs there (without trying to open the page).
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def test_allmusic_access():
# Browser configuration
browser_config = BrowserConfig(
headless=False,
use_managed_browser=True,
browser_type="chromium",
user_data_dir="path/to/my/userdata/dir"
)
# Basic crawler configuration
crawler_config = CrawlerRunConfig(
magic=True
)
async with AsyncWebCrawler(
verbose=True,
config=browser_config
) as crawler:
result = await crawler.arun(
url="https://www.allmusic.com/artist/ringo-starr-mn0000217792",
config=crawler_config
)
print("\nAccess test results:")
print(f"Success: {result.success}")
print(f"Status code: {result.status_code}")
if __name__ == "__main__":
asyncio.run(test_allmusic_access())