polipus
polipus copied to clipboard
Edit regular expression in charge of removing anchor, simply add 'colon'
I found that urls containing anchors like "#sku:123" (e.g a semi-colon) were not cleaned up when passed to the to_absolute
method . As a consequence, they were escaped and added to the queue of the crawler, which led to 404 errors. This kind of bug is related to the issue I described here.
To fix it, this commit adds a colon in the regular expression used to remove anchor from urls in the to_absolute
method.