crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

I encountered an issue where the parameters were not effective during use. The actual css_selector and excluded_tags had no effect, and executing the process returned the entire page content

Open monkey-wenjun opened this issue 10 months ago • 2 comments
trafficstars

I encountered an issue where the parameters were not effective during use. The actual css_selector and excluded_tags had no effect, and executing the process returned the entire page content


import asyncio
from crawl4ai import AsyncWebCrawler

async def main():

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://doc.youzanyun.com/detail/API/0/323",
            css_selector=".api-detail",
            excluded_tags=['form', 'nav','footer']
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

monkey-wenjun avatar Jan 07 '25 10:01 monkey-wenjun

解决了吗,兄弟?

no

monkey-wenjun avatar Jan 07 '25 13:01 monkey-wenjun

I have the same problem and haven't solved it yet. How about you?

duolaOmeng avatar Jan 08 '25 03:01 duolaOmeng

i also found a same issue and haven't solved yet, but temporary i used a hard coded technique if anyone want i found a starting point and ending point pettern and used it:

async with AsyncWebCrawler() as crawler:
        for url in links:
            try:
                result = await crawler.arun(
                    url=url,
                    css_selector="h1:contains('RPC Method') ~ *:not(#sidebar):not(nav):not(footer):not(.Button-module_button__peGiP)",
                    excluded_tags=[
                        'script', 
                        'style', 
                        'button',
                        'nav',
                        'header',
                        'footer',
                        'aside',
                        'iframe'
                    ],
                    word_count_threshold=1,
                    exclude_external_links=True,
                    exclude_social_media_links=True,
                    remove_overlay_elements=True
                )
                
                content = result.markdown.strip()
                
                # Find the start of content using regex to match "# <anything> RPC Method"
                rpc_method_pattern = r"#\s+[\w]+ RPC Method"
                match = re.search(rpc_method_pattern, content)
                
                if match:
                    content = content[match.start():]
                
                # Find where to cut off the content (after curl example)
                end_markers = [
                    "Don't have an account yet?",
                    "Get started for free",
                    "Previous",
                    "Next",
                    "Chat with our community"
                ]
                
                # Make sure we include the complete curl example
                curl_end = "```"
                if curl_end in content:
                    last_curl_end = content.rindex(curl_end) + len(curl_end)
                    content = content[:last_curl_end]
                
                # Remove any remaining content after the curl example
                for marker in end_markers:
                    if marker in content:
                        content = content[:content.index(marker)].strip()
                
                # Clean up any duplicate headers
                lines = content.split('\n')
                seen_headers = set()
                cleaned_lines = []
                
                for line in lines:
                    if line.startswith('#'):
                        # Only add header if we haven't seen it before
                        header_key = line.lower().strip()
                        if header_key not in seen_headers:
                            seen_headers.add(header_key)
                            cleaned_lines.append(line)
                    else:
                        cleaned_lines.append(line)
                
                content = '\n'.join(cleaned_lines).strip()
                
                # Save to file
                endpoint = url.split('/')[-1]
                filename = os.path.join(output_folder, f"{endpoint}.md")
                
                with open(filename, "w", encoding="utf-8") as file:
                    file.write(content)
                    
                print(f"Successfully scraped and saved: {filename}")
                
            except Exception as e:
                print(f"Error processing {url}: {str(e)}")
                continue

if _name_ == "__main__":
    asyncio.run(main())```

jaydobariya8 avatar Jan 08 '25 05:01 jaydobariya8

@monkey-wenjun @Shadow062309 @duolaOmeng Thank you for trying Crawl4ai. In such situations, the first step is to run the crawler with headless set to false to see what is happening. If you do that, you will notice that this website has a random delay at the beginning. Perhaps this is one way they retrieve the data from the backend and server.

Because you set a specific CSS selector, you must first ensure that the element exists on the page. To do this, you need to use the wait_for function. In the following code, when I applied the wait_for, everything works perfectly because you instruct the crawler to wait for the presence of that element.

So whenever you use targeted elements or CSS selectors, make sure to use wait_for, or consider another parameter that allow for an extra delay before returning to HTML, and that is called delay_before_return_html. That is one general approach. However, if you want to be more specific, wait_for is your solution.

async def main():
    config = BrowserConfig(
        headless=True,
    )

    async with AsyncWebCrawler(config=config) as crawler:
        crawl_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            css_selector=".api-detail",
            excluded_tags=['form', 'nav','footer'],
            wait_for="css:.api-detail",
            # delay_before_return_html=2
        )
        result = await crawler.arun(
            url="https://doc.youzanyun.com/detail/API/0/323",
            config=crawl_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())
INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://doc.youzanyun.com/detail/API/0/323... | Status: True | Time: 2.97s
[SCRAPE].. ◆ Processed https://doc.youzanyun.com/detail/API/0/323... | Time: 91ms
[COMPLETE] ● https://doc.youzanyun.com/detail/API/0/323... | Status: True | Total: 3.07s
youzan.user.openid.get.1.0.0
计费
2020-03-18 17:21:14
API名称:获取有赞openId
API描述
API描述

根据userId(有赞账号id)查询有赞openId(注意是有赞openId,非微信openId)

公共参数
...REST OF MARKDOWNzzz

unclecode avatar Jan 08 '25 12:01 unclecode

TypeError: BrowserConfig() takes no arguments see my below code : import asyncio

from crawl4ai import BrowserConfig, AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main(): config = BrowserConfig( headless=True, )

async with AsyncWebCrawler(config=config) as crawler:
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        css_selector=".api-detail",
        excluded_tags=['form', 'nav','footer'],
        wait_for="css:.api-detail",
        # delay_before_return_html=2
    )
    result = await crawler.arun(
        url="https://tmotions.com/success-stories/autocar/",
        config=crawl_config
    )
    print(result.markdown)

if name == "main": asyncio.run(main())

Aravind1Kumar avatar Jan 14 '25 12:01 Aravind1Kumar

@Aravind1Kumar That's a very odd error. I can not replicate it. You can't use the same css_selector that I provided as an example for the other domain in this domain because this domain doesn't have anything like .api-details. So, remove those lines. It should work.

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig

async def main():
    config = BrowserConfig(
        headless=True,
    )

    async with AsyncWebCrawler(config=config) as crawler:
        crawl_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            # css_selector=".api-detail",
            # excluded_tags=['form', 'nav','footer'],
            # wait_for="css:.api-detail",
            # delay_before_return_html=2
        )
        result = await crawler.arun(
            url="https://tmotions.com/success-stories/autocar/",
            config=crawl_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

unclecode avatar Jan 15 '25 14:01 unclecode