crawl4ai
crawl4ai copied to clipboard
[Bug]: Relative links not working with BFS Deepcrawl
crawl4ai version
0.6.1
Expected Behavior
The website https://www.gpc-tec.de redirects to https://www.gpc-tec.de/start.html . On this page the links are relative, eg. IMPRESSUM . Klicking on it leads to https://www.gpc-tec.de/impressum.html, which shows the correct page. The BFS Deepcrawl should interpret this link the same way a Link-Click would act.
Current Behavior
The BFS Deepcrawl however, interprets this link to be https://www.gpc-tec.de/start.html/impressum.html, which is wrong and leads to a 404.
Is this reproducible?
Yes
Inputs Causing the Bug
- Using the docker rest crawl endpoint with this body:
{
"urls": ["https://www.gpc-tec.de"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": true}},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"cache_mode": "BYPASS",
"deep_crawl_strategy": {
"type": "BFSDeepCrawlStrategy",
"params": {
"max_depth": 1,
"max_pages": 10,
"filter_chain": {
"type": "FilterChain",
"params": {"filters": [
{"type": "ContentTypeFilter", "params": {
"allowed_types": ["text/html"]}}
]}
},
"url_scorer": {
"type": "CompositeScorer",
"params": {
"scorers": [
{
"type": "KeywordRelevanceScorer",
"params": {"keywords": ["Impressum","Career","Karriere","Imprint"], "weight": 1.5}
},
{
"type": "PathDepthScorer",
"params": {"optimal_depth": 1, "weight": -0.1}
}
]
}
}
}
},
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {
"threshold": 0.6,
"threshold_type": "relative"
}
}
}
},
"locale":"de-DE",
"delay_before_return_html": 3
}
}
}
Steps to Reproduce
1. Start the Docker-Container
2. Make a call against the crawl endpoint with the provided Body.
Code snippets
OS
Linux Debian, running inside a Docker-Container
Python version
Docker images Python Version
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response