wayback-machine-archiver icon indicating copy to clipboard operation
wayback-machine-archiver copied to clipboard

flag for setting no to "save error pages"

Open test2a opened this issue 5 years ago • 18 comments
trafficstars

hi. Backing up twitter is throwing error pages but if we manually add the url on http://web.archive.org/save and uncheck the save error pages, the page is saved. It must be a problem with twitter or something. Anyways, is there a flag with which we can unset this error page setting? i am hoping this would help

test2a avatar Apr 22 '20 13:04 test2a

To answer the question directly: no, that flag is not currently supported. But I would be interested in getting it to work!

Some Exploration

Looks like when you click the box, the request body is just a url parameter. Otherwise it also sends capture_all=on.

  • Without: url=twitter.com%2Fnoahpinion
  • With: url=twitter.com%2Fnoahpinion&capture_all=on

I'll see what I can do. I noticed they have an API now, so I've sent an email to Archive.org for more information.

agude avatar Apr 23 '20 04:04 agude

thats great. i have a further question but i didnt want to start another issue so i'll ask here. If i have to send a bunch of urls to backup at once, can i use a text file because all i can find is about using an xml. So, if i created an xml with the urls, would that work?

edit: oh. my bad. found what i was looking for. thanks anyways

test2a avatar Apr 24 '20 07:04 test2a

Yes, these is, and I don't yet have it documented in the README, oops!

Here is how:

Create a text file with one url per line, like this:

https://google.com
https://amazon.com

Let's say that's named urls.txt. Then call the script like this:

archiver --file urls.txt

If you are saving a large number of pages, you might want to set --rate-limit-wait to a large number, because Archive.org will ratelimit and then block you if you hammer them too hard too fast. I've had it happen to me, which is why the default rate limit in the script is 5.

agude avatar Apr 24 '20 14:04 agude

oh. i am using a script to get a list of urls and yeah, i used exactly that. found it in the --help actually. anyways, when i tried to do

from wayback_machine_archiver import archiver

and then

archiver (variable)

it says

archiver (variable) TypeError: 'module' object is not callable

now, further used

wayback_machine_archiver.archiver (url)

but it resulted in

wayback_machine_archiver.archiver (url) NameError: name 'wayback_machine_archiver' is not defined

Now, i have managed to bypass this error by outputting my "url" variable which prints onscreen to a text file which i then import into archiver using

archiver --file textfilename

is it possible to make archiver accept the urls via a variable that outputs them one line at a time?

Thanks a bunch. its really appreciated

url is my variable that contains the list of urls.

test2a avatar Apr 24 '20 14:04 test2a

oh, also, do we need to do some sort of delta check with the text file to see if the urls have been updated already or does waybackmachine accept the whole thing just like that? second, isnt there some sort of feedback ? saved this url or didnt save this url? something else?

i'm sorry for bugging you with these trivial things

test2a avatar Apr 24 '20 14:04 test2a

Oh, good questions. Let me try to summarize them and answer:

Can I use archiver as a library so my own script can easily backup URLs?

Not in it's current state. You could import the individual functions and write a little glue code though. Something like:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from wayback_machine_archiver import format_archive_url, call_archiver

# Set up the requests session
session = requests.Session()

retries = Retry(
    total=5,
    backoff_factor=5,
    status_forcelist=[500, 502, 503, 504],
)

session.mount("https://", HTTPAdapter(max_retries=retries))
session.mount("http://", HTTPAdapter(max_retries=retries))

# Backup a URL that is stored in a string called url
formatted_url = format_archive_url(url)
call_archiver(formatted_url, rate_limit_wait=5, session)

Do I have to check for duplicate URLs?

~Yes. I don't check for duplicates and neither does Internet Archive (except they might return an error, because they don't allow more than one backup per unique URL per 10 minutes).~

~I think it would be reasonable for me to de-duplicate the URLs before archiving them though. I'll open a bug for that and fix it tonight.~

If you're using the script on the command line, no you do not. I now check for duplicates starting in version 1.6.0.

That, of course, won't help if you use my above code suggestion to use my code as a library.

One Last Thought...

If you're on Linux, and your program is outputting URLs on stdout, you could do something like:

test2a_program | xargs archiver

This assumes you output all of the URLs at once, if you output them one after the other you could use sponge (from moreutils):

test2a_program | sponge | xargs archiver

agude avatar Apr 26 '20 05:04 agude

Twitter is not working. The page saves but it says not found in the snapshot. For the past few days, I am also not seeing "snapshot" even when I click on the button on the website.

test2a avatar May 21 '20 02:05 test2a

Does it work if you go to the Wayback Machine website and archive the Twitter page?

I haven't changed anything in the script, so it's possible they changed something on the backend.

agude avatar May 21 '20 03:05 agude

nah. even the website web.archive.org/save doesnt work. i was able to like a week ago to uncheck show error and snapshot but both seem unresponsive today

test2a avatar May 21 '20 05:05 test2a

i am still testing. I was able to save a twitter link but photos arent coming up. i will continue testing more urls and report my findings

test2a avatar May 21 '20 06:05 test2a

Hi all, I see that above you talk about duplicate URLs.

When the archiver runs web.archive.org/save/ what happens if the URL was already archived? does it replace the previous version or a new time-stamp is added?

Another question, is it possible to recover the time-stamp when saving?

Thanks!

lauhaide avatar Aug 11 '20 18:08 lauhaide

@lauhaide i think archiver saves a new copy of the url with current date and time in the wayback machine so you can see pages over time and no it does not overwrite anything

second one, i am not sure what you mean. when the page is saved, it saves the metadata along with it so you can see that

test2a avatar Aug 11 '20 18:08 test2a

Hi @lauhaide!

This script doesn't overwrite, as it were, because it asks The Wayback Machine to save a snapshot of the current page. As you can see here with the Yahoo.com archive there are multiple snapshots stored each day.

As for recovering the timestamp, you could get that from the Wayback Machine itself (you'll see each snapshot is timestamped on the Yahoo page for example), but that's not something this tool supports.

If you are on Linux, you could do something like this:

archiver https://yahoo.com --log DEBUG 2>&1 | ts '[%Y-%m-%d %H:%M:%S]'

That would timestamp every line of the debug output like this:

[2020-08-11 13:17:12] DEBUG:root:Arguments: Namespace(archive_sitemap=False, file=None, jobs=1, log_file=None, log_level='DEBUG', rate_limit_in_sec=5, sitemaps=[], urls=['https://yahoo.com'])
[2020-08-11 13:17:12] INFO:root:Adding page URLs to archive
[2020-08-11 13:17:12] DEBUG:root:Page URLs to archive: ['https://yahoo.com']
[2020-08-11 13:17:12] DEBUG:root:Creating archive URL for https://yahoo.com
[2020-08-11 13:17:12] INFO:root:Parsing sitemaps
[2020-08-11 13:17:12] DEBUG:root:Archive URLs: {'https://web.archive.org/save/https://yahoo.com'}
[2020-08-11 13:17:13] DEBUG:root:Sleeping for 5
[2020-08-11 13:17:18] INFO:root:Calling archive url https://web.archive.org/save/https://yahoo.com
[2020-08-11 13:17:18] DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org:443
[2020-08-11 13:17:18] DEBUG:urllib3.connectionpool:https://web.archive.org:443 "HEAD /save/https://yahoo.com HTTP/1.1" 301 0
[2020-08-11 13:17:30] DEBUG:urllib3.connectionpool:https://web.archive.org:443 "HEAD /save/https://www.yahoo.com/ HTTP/1.1" 200 0

Which you could use to get a rough timestamp from.

I will add: I don't consider DEBUG messages as part of the public API, so I might break your script with a minor update, but probably won't. :-)

agude avatar Aug 11 '20 20:08 agude

Thanks @agude , @test2a for your prompt replies. It's now clear to me now, save will add a new backup of the URL.

As for the timestamp for further retrieval, it could be something like this (time seems not necessary):

http://web.archive.org/web/20200811*/URL

A last question about --rate-limit-wait (as mentioned in above posts), for a large nb of pages to archive, which minimum value would be recommended to use?

PS. no problem with the code update :-)

lauhaide avatar Aug 12 '20 08:08 lauhaide

I run this script to back up my personal site every evening. It's about 100 pages, and I run with --rate-limit-wait=60. It completes most of the time, but every few weeks it'll error out due to rate limiting from the Internet Archive.

So I don't have an exact number for you, but I would say closer to 30-60 seconds than 1-2. :-)

agude avatar Aug 12 '20 16:08 agude

Thanks @agude , I had started running with --rate-limit-wait=5 and is running , will it log if the request gets an error?

lauhaide avatar Aug 13 '20 18:08 lauhaide

@lauhaide: The program will throw an error and terminate when it fails. Right here:

https://github.com/agude/wayback-machine-archiver/blob/master/wayback_machine_archiver/archiver.py#L36-L41

agude avatar Aug 14 '20 02:08 agude

Thanks a lot @agude !!!

lauhaide avatar Aug 14 '20 17:08 lauhaide