abotx
abotx copied to clipboard
Constructing wrong URLs to crawl from anchor tags without scheme
The ParallelCrawlerEngine is getting the wrong URLs to crawl. Upon checking the page in the Parent URI, I could not find where it gets the wrong URL. It's probably the <a>
anchor tag without the scheme "https://"
<a href="www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481">
bla bla
</a>
Parent URI: https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/468046
Parsed Hyperlink (Wrong URL): https://www.thelawyermag.com/au/best-in-law/best-in-law-2023/www.thelawyermag.com/au/best-in-law/best-legal-tech-and-legal-service-providers-in-australia-and-new-zealand-service-provider-awards/467481