js-crawler icon indicating copy to clipboard operation
js-crawler copied to clipboard

How to deal with shortened URLs

Open lukasIO opened this issue 7 years ago • 3 comments

Hi,

is there a way to retrieve the landing url of a shortened url like goo.gl/89234fIASVHAS ? Right now the crawler will pass the shortened url into the callback, which messes up all relative links on the crawled pages... Thanks!

lukasIO avatar Apr 14 '17 16:04 lukasIO

From a quick look it seems like bit.ly uses the status 301 "Moved permanently" and goo.gl 307 "Internal redirect" will need to investigate the case of URL shorteners a bit more.

amoilanen avatar Apr 16 '17 20:04 amoilanen

thanks for your reply, do you have any advice on how to work around it for now?

lukasIO avatar Apr 26 '17 23:04 lukasIO

Right now the crawler will pass the shortened url into the callback

I fixed this part, added a unit test and published a new version of the crawler 0.3.19

However I could not reproduce the original issue when passing the first url into the onSuccess callback would cause problems with relative urls:

which messes up all relative links on the crawled pages...

Please, let me know if the problem has been fixed with the recent changes.

amoilanen avatar Apr 28 '17 17:04 amoilanen