Extract embedded URLs from JavaScript files
From what I understand, based on the source code, the parser doesn't extract embedded URLs from JavaScript files. Is there any particular reason for not supporting this? Maybe because wget doesn't support that feature? I feel it's a simple add-on and can significantly improve the fidelity of statically crawled pages.
We had this kind of request before and rejected it because often JS produces URLs dynamically and we don't want to run external code (e.g. via V8).
But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?
But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?
That is correct. Simple regex based grepping could significantly increase the set of URLs fetched. I am running some analysis of my own, and happy to report back with some empirical data, but to give an example, here is a code snippet from www.nytimes.com
e = "https://static01.nyt.com/ads/tpc-check.html",
a = document.body,
(r = document.createElement("iframe")).src = e,
As you can see, the entire URL is statically embedded and doesn't require any JS execution to construct.
Regarding what kind of regex to use, here is how InternetArchive greps for such URLs, though I am not sure what kind of false positives/negatives they get with this. If I am able to determine a better regex as a part of my analysis, will share that here.