abotx icon indicating copy to clipboard operation
abotx copied to clipboard

Constructing wrong URLs to crawl from anchor tags without scheme

Open KeirLoire opened this issue 8 months ago • 0 comments

The ParallelCrawlerEngine is getting the wrong URLs to crawl. Upon checking the page in the Parent URI, I could not find where it gets the wrong URL. It's probably the <a> anchor tag without the scheme "https://"

<a href="www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481"> 
    bla bla
</a>

Parent URI: https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/468046

Parsed Hyperlink (Wrong URL): https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481

KeirLoire avatar Jun 05 '24 10:06 KeirLoire