wget2 icon indicating copy to clipboard operation
wget2 copied to clipboard

Extract embedded URLs from JavaScript files

Open goelayu opened this issue 2 years ago • 2 comments

From what I understand, based on the source code, the parser doesn't extract embedded URLs from JavaScript files. Is there any particular reason for not supporting this? Maybe because wget doesn't support that feature? I feel it's a simple add-on and can significantly improve the fidelity of statically crawled pages.

goelayu avatar Mar 23 '23 23:03 goelayu

We had this kind of request before and rejected it because often JS produces URLs dynamically and we don't want to run external code (e.g. via V8).

But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?

rockdaboot avatar Mar 24 '23 11:03 rockdaboot

But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?

That is correct. Simple regex based grepping could significantly increase the set of URLs fetched. I am running some analysis of my own, and happy to report back with some empirical data, but to give an example, here is a code snippet from www.nytimes.com

e = "https://static01.nyt.com/ads/tpc-check.html",
a = document.body,
(r = document.createElement("iframe")).src = e,

As you can see, the entire URL is statically embedded and doesn't require any JS execution to construct.

Regarding what kind of regex to use, here is how InternetArchive greps for such URLs, though I am not sure what kind of false positives/negatives they get with this. If I am able to determine a better regex as a part of my analysis, will share that here.

goelayu avatar Mar 24 '23 12:03 goelayu