crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Relative links not working with BFS Deepcrawl

Open saschalev opened this issue 6 months ago • 0 comments

crawl4ai version

0.6.1

Expected Behavior

The website https://www.gpc-tec.de redirects to https://www.gpc-tec.de/start.html . On this page the links are relative, eg. IMPRESSUM . Klicking on it leads to https://www.gpc-tec.de/impressum.html, which shows the correct page. The BFS Deepcrawl should interpret this link the same way a Link-Click would act.

Current Behavior

The BFS Deepcrawl however, interprets this link to be https://www.gpc-tec.de/start.html/impressum.html, which is wrong and leads to a 404.

Is this reproducible?

Yes

Inputs Causing the Bug

- Using the docker rest crawl endpoint with this body: 

{
        "urls": ["https://www.gpc-tec.de"],
        "browser_config": {"type": "BrowserConfig", "params": {"headless": true}},
        "crawler_config": {
            "type": "CrawlerRunConfig",
            "params": {
                "cache_mode": "BYPASS",
                "deep_crawl_strategy": {
                    "type": "BFSDeepCrawlStrategy",
                    "params": {
                        "max_depth": 1,
                        "max_pages": 10,
                        "filter_chain": {
                            "type": "FilterChain",
                            "params": {"filters": [
                                {"type": "ContentTypeFilter", "params": {
                                    "allowed_types": ["text/html"]}}
                            ]}
                        },
                        "url_scorer": {
                            "type": "CompositeScorer",
                            "params": {
                                "scorers": [
                                    {   
                                        "type": "KeywordRelevanceScorer",
                                        "params": {"keywords": ["Impressum","Career","Karriere","Imprint"], "weight": 1.5}
                                    },
                                    {   
                                        "type": "PathDepthScorer",
                                        "params": {"optimal_depth": 1, "weight": -0.1}
                                    }
                                ]
                            }
                        }
                    }
                },
                "markdown_generator": {
                    "type": "DefaultMarkdownGenerator",
                    "params": {
                        "content_filter": {
                            "type": "PruningContentFilter",
                            "params": {
                                "threshold": 0.6,
                                "threshold_type": "relative"
                            }
                        }
                    }
                },
                "locale":"de-DE",
                "delay_before_return_html": 3
            }
        }
    }

Steps to Reproduce

1. Start the Docker-Container
2. Make a call against the crawl endpoint with the provided Body.

Code snippets


OS

Linux Debian, running inside a Docker-Container

Python version

Docker images Python Version

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

saschalev avatar May 18 '25 14:05 saschalev