wayback-machine-downloader
wayback-machine-downloader copied to clipboard
[Feature Request] Rewrite every urls to relative urls after download is complete
Several users encounter issues when their websites used to contain absolute urls (ie. http://example.com/style.css) making images, styles and even page links appear broken when trying to open the website from the downloaded copy. Ie: https://github.com/hartator/wayback-machine-downloader/issues/6
Unfortunately, I don't have time to work on this, so it's up for anyone to work on it!
The goal is to write a script that will rewrite every absolute urls in every downloaded files to point to the local copies. For example, http://example.com/static/style.css should be './static/style.css' assuming a webpage at the root.
The challenges are urls can be more complex than that and the script should be aware where the local copies are relatively to current file location. It should also be a CLI option as not everyone wants urls to be rewritten.
Ask me any question!
Is this issue related to the fact, that e.g. in case of wayback_machine_downloader -a -t 20150804003754 http://www.daovm.net/
all img paths are "relative" to the web page folder, but they begin with /
which makes them absolute and therefore no pictures are displayed?
wget's -k
option does this. I'd recommend keeping compatibility with them
I just downloaded an archive.org website. Many of the links skip the real directory and go to root. Instead of "C:/Ruby/bin/websites/archorg_site/subfolder/file", they have "/subfolder/file". This is a problem in the index.html of each subfolder. I can probably write a bash or awk script to search and replace the references. Thought I'd check first for status. My command line was "ruby.exe wayback_machine_downloader http://www.pveducation.org --from 20160305230547"
To rewrite the website, a knowledge of the systematic changes made is needed. There is a need to be abl;e to predict which files will need editing and what changes to make to which 'href' statements. I tried editing one file, but the numb of href's is huge. And they're for several files. I can't think of a way to write a script that cannot distinguish between files. Can the errors be predicted? And therefore programmed into a search and edit script?
Awesome tool! Any chance the relative link scanning will be working soon? This service http://waybackdownloader.com/ seems to be able to handle most relative links even if the links start with a "/" or if there is not slash also.
@relentless1 same for https://waybackdownloads.com/
We could use a sed in order to replace it after it's been downloaded. I don't know ruby at all so I can't provide a PR, though.
You can use sed regexes directly in Ruby via String.scan // and parse the html with nokogiri.
Another alternative is to get the version with the wayback machine marking. (Without the_id when getting the file). There is markings from Archive.org, but links are relative. We just have to remove the markings.
On Wed, Dec 6, 2017 at 5:20 AM Telokis [email protected] wrote:
We could use a sed in order to replace it after it's been downloaded. I don't know ruby at all so I can't provide a PR, though.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/26#issuecomment-349610089, or mute the thread https://github.com/notifications/unsubscribe-auth/AASxjUOke84A4xdrBLGMgv6HhF8N-ALZks5s9nhkgaJpZM4HBvGY .
First step could be to provide basic information about how to approach this "by hand" (sed) in the README, before tackling the harder problem of automating it. Also, there will always be edge cases where it will not work nicely (dynamic urls via javascript etc).
This would indeed be a very helpful feature!
For reference, the -k
option of wget
, aka --convert-links
, is described this way in the wget man page:
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
Each link will be changed in one of the two ways:
The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link. Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary combinations of directories.
The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to. Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point to http://hostname/bar/img.gif.
Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.
--convert-links> Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by ‘-k’ will be performed at the end of all the downloads.
As noted there, this feature requires knowledge of what was download, and works best when integrated with the downloader. Doing it after download is trickier.
In the interim, for those on Linux, willing to try something similar by hand, here is a very crude sed
command. It generates a new file in which all occurrences of https://example.com
have simply been deleted. That will typically make the references relative to the root of the webserver, so it works if the webserver is serving exactly what used to be the full site.
sed -i 's,https://example.com,,g' index.html > index_relative.html
Note that it doesn't make a properly relative site which could be easily combined with other relative sites in a single server.
If that seems to be a step in the right direction for your site, you can automate it across the current directory tree recursively with the following. It recursively visits every .html
file under the current directory and all directories below it, and uses the -i
option of sed
to replace each file with the modified version.
WARNING: this could mess things up, so backup first, and check all edits carefully afterwards!
for f in $(find . -name '*html*'); do sed -i 's,https://example.com,,g' $f; done
FWIW, another tool for rewriting absolute to relative urls, written in Perl, is reported here: Changing all HTML absolute links to relative I haven't tried it.