crawl4ai
crawl4ai copied to clipboard
I encountered an issue where the parameters were not effective during use. The actual css_selector and excluded_tags had no effect, and executing the process returned the entire page content
I encountered an issue where the parameters were not effective during use. The actual css_selector and excluded_tags had no effect, and executing the process returned the entire page content
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://doc.youzanyun.com/detail/API/0/323",
css_selector=".api-detail",
excluded_tags=['form', 'nav','footer']
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
解决了吗,兄弟?
no
I have the same problem and haven't solved it yet. How about you?
i also found a same issue and haven't solved yet, but temporary i used a hard coded technique if anyone want i found a starting point and ending point pettern and used it:
async with AsyncWebCrawler() as crawler:
for url in links:
try:
result = await crawler.arun(
url=url,
css_selector="h1:contains('RPC Method') ~ *:not(#sidebar):not(nav):not(footer):not(.Button-module_button__peGiP)",
excluded_tags=[
'script',
'style',
'button',
'nav',
'header',
'footer',
'aside',
'iframe'
],
word_count_threshold=1,
exclude_external_links=True,
exclude_social_media_links=True,
remove_overlay_elements=True
)
content = result.markdown.strip()
# Find the start of content using regex to match "# <anything> RPC Method"
rpc_method_pattern = r"#\s+[\w]+ RPC Method"
match = re.search(rpc_method_pattern, content)
if match:
content = content[match.start():]
# Find where to cut off the content (after curl example)
end_markers = [
"Don't have an account yet?",
"Get started for free",
"Previous",
"Next",
"Chat with our community"
]
# Make sure we include the complete curl example
curl_end = "```"
if curl_end in content:
last_curl_end = content.rindex(curl_end) + len(curl_end)
content = content[:last_curl_end]
# Remove any remaining content after the curl example
for marker in end_markers:
if marker in content:
content = content[:content.index(marker)].strip()
# Clean up any duplicate headers
lines = content.split('\n')
seen_headers = set()
cleaned_lines = []
for line in lines:
if line.startswith('#'):
# Only add header if we haven't seen it before
header_key = line.lower().strip()
if header_key not in seen_headers:
seen_headers.add(header_key)
cleaned_lines.append(line)
else:
cleaned_lines.append(line)
content = '\n'.join(cleaned_lines).strip()
# Save to file
endpoint = url.split('/')[-1]
filename = os.path.join(output_folder, f"{endpoint}.md")
with open(filename, "w", encoding="utf-8") as file:
file.write(content)
print(f"Successfully scraped and saved: {filename}")
except Exception as e:
print(f"Error processing {url}: {str(e)}")
continue
if _name_ == "__main__":
asyncio.run(main())```
@monkey-wenjun @Shadow062309 @duolaOmeng Thank you for trying Crawl4ai. In such situations, the first step is to run the crawler with headless set to false to see what is happening. If you do that, you will notice that this website has a random delay at the beginning. Perhaps this is one way they retrieve the data from the backend and server.
Because you set a specific CSS selector, you must first ensure that the element exists on the page. To do this, you need to use the wait_for function. In the following code, when I applied the wait_for, everything works perfectly because you instruct the crawler to wait for the presence of that element.
So whenever you use targeted elements or CSS selectors, make sure to use wait_for, or consider another parameter that allow for an extra delay before returning to HTML, and that is called delay_before_return_html. That is one general approach. However, if you want to be more specific, wait_for is your solution.
async def main():
config = BrowserConfig(
headless=True,
)
async with AsyncWebCrawler(config=config) as crawler:
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector=".api-detail",
excluded_tags=['form', 'nav','footer'],
wait_for="css:.api-detail",
# delay_before_return_html=2
)
result = await crawler.arun(
url="https://doc.youzanyun.com/detail/API/0/323",
config=crawl_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://doc.youzanyun.com/detail/API/0/323... | Status: True | Time: 2.97s
[SCRAPE].. ◆ Processed https://doc.youzanyun.com/detail/API/0/323... | Time: 91ms
[COMPLETE] ● https://doc.youzanyun.com/detail/API/0/323... | Status: True | Total: 3.07s
youzan.user.openid.get.1.0.0
计费
2020-03-18 17:21:14
API名称:获取有赞openId
API描述
API描述
根据userId(有赞账号id)查询有赞openId(注意是有赞openId,非微信openId)
公共参数
...REST OF MARKDOWNzzz
TypeError: BrowserConfig() takes no arguments see my below code : import asyncio
from crawl4ai import BrowserConfig, AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main(): config = BrowserConfig( headless=True, )
async with AsyncWebCrawler(config=config) as crawler:
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector=".api-detail",
excluded_tags=['form', 'nav','footer'],
wait_for="css:.api-detail",
# delay_before_return_html=2
)
result = await crawler.arun(
url="https://tmotions.com/success-stories/autocar/",
config=crawl_config
)
print(result.markdown)
if name == "main": asyncio.run(main())
@Aravind1Kumar That's a very odd error. I can not replicate it. You can't use the same css_selector that I provided as an example for the other domain in this domain because this domain doesn't have anything like .api-details. So, remove those lines. It should work.
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig
async def main():
config = BrowserConfig(
headless=True,
)
async with AsyncWebCrawler(config=config) as crawler:
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
# css_selector=".api-detail",
# excluded_tags=['form', 'nav','footer'],
# wait_for="css:.api-detail",
# delay_before_return_html=2
)
result = await crawler.arun(
url="https://tmotions.com/success-stories/autocar/",
config=crawl_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())