wayback how to parse attachments/files and download them!

Hi Harbour Master,

yet another brillant package from you! I wonder if there is an easy way to pull all the files from the archived website in the wayback archive. For instance, something like "get all the .pdfs from all archives (in a given time range) from this website".

I do these kind of queries manually on the wayback archive, and it is very time consuming and annoying. Being able to do that programmatically with your package would be really nice.

What do you think? Thanks!

Sep 17 '18 15:09 randomgambit

thx.

well, "aye" but the pkg doesn't do it yet. I've had the "scraping api" on my "todo" list for a while but haven't had the time to work on it. Ref: https://archive.org/help/aboutsearch.htm & https://archive.org/advancedsearch.php & https://archive.readme.io/docs/

Lemme see how much effort it'll take to add in support (paginated APIs on resource-constrained sites are so not-fun to work with).

Sep 17 '18 16:09 hrbrmstr

@hrbrmstr amazing, that would be great. I really believe this is what most people do with the archive "How can I get that annoying old zip file that was available 3 years ago??"

Sep 17 '18 16:09 randomgambit

When you get some time, it'd be 👍 If you could poke at the (just added) nascent "Scrape API" calls (https://github.com/hrbrmstr/wayback/blob/master/R/ia-scrape.R) and then let me know what extra helpers I should add to support the use case.

Sep 17 '18 20:09 hrbrmstr

sure of course. let me try that asap! thanks!

Sep 17 '18 20:09 randomgambit

Give ia_retreive() a go. I think that might be what you were looking for (just added)

https://github.com/hrbrmstr/wayback/blob/master/R/ia-retrieve.R

Sep 17 '18 20:09 hrbrmstr

@hrbrmstr that seems pretty neat but I wonder if I have explained correctly what I had in mind. Imagine that you are interested in the free csvs from maxmind.com

Now going to https://web.archive.org/web//http://maxmind.com/ (<-- add the star at the end) shows you ALL the links on the maxmind domain that were saved in the archive. You can see that there is a field where you can filter by type, say csv, or pdf.

This is hugely valuable because you can pull all the attachments at once from a website, but is it as real PITA because it has to be manual. I wonder if your package can retrieve that information or perhaps I have misunderstood what you did.

Thanks!

Sep 18 '18 01:09 randomgambit

AH!

Gotcha. Let me see how that works memento/timemap-API-wise. Pretty sure I can rig up something.

Sep 18 '18 11:09 hrbrmstr

Looks like there's a "new-ish" CDX parameter used in that particular online query interface that I did not have support for in the package. I've added it to the cdx_basic_query() function and (as noted below) I think it provides the assistance you were inquiring about.

Def let me know if I need to tweak this more and — if you have some time and wouldn't mind — please add yourself to the DESCRIPTION (a new person() item) as a contributor (ctb) as this was an immensely helpful suggestion and discussion.

library(wayback)
library(tidyverse)

cdx <- cdx_basic_query("http://maxmind.com/", "prefix")

filter(cdx, grepl("csv", original))
## # A tibble: 43 x 7
##    urlkey       timestamp           original         mimetype  statuscode digest     length
##    <chr>        <dttm>              <chr>            <chr>     <chr>      <chr>       <dbl>
##  1 com,maxmind… 2002-10-14 00:00:00 http://maxmind.… text/html 200        IFDVCDHMB…  2733.
##  2 com,maxmind… 2006-05-17 00:00:00 http://www.maxm… text/html 200        7HYYDOKDG…  1717.
##  3 com,maxmind… 2009-02-11 00:00:00 http://maxmind.… text/html 301        BCL36PMUW…   405.
##  4 com,maxmind… 2008-12-10 00:00:00 http://www.maxm… text/html 200        JZCCABPE7…  1962.
##  5 com,maxmind… 2009-10-18 00:00:00 http://www.maxm… text/pla… 200        WGT2VMJ6S…   957.
##  6 com,maxmind… 2003-08-15 00:00:00 http://www.maxm… text/html 404        U5A45F3Y2…   392.
##  7 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404        Y3VUK7LZQ…   413.
##  8 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404        Z4BTKJJPQ…   413.
##  9 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404        WXKDYKM67…   411.
## 10 com,maxmind… 2006-11-27 00:00:00 http://www.maxm… text/html 404        XVCYDXUBM…   421.
## # ... with 33 more rows

Sep 18 '18 12:09 hrbrmstr

Hrm. I just made that a bit better by also adding in support for filtering (like the web ux has). By default it only returns items with a 200 status code.

library(wayback)
library(tidyverse)

cdx <- cdx_basic_query("http://maxmind.com/", "prefix")

(csv <- filter(cdx, grepl("\\.csv", original)))
## # A tibble: 9 x 7
##   urlkey                          timestamp           original                             mimetype  statuscode digest       length
##   <chr>                           <dttm>              <chr>                                <chr>     <chr>      <chr>         <dbl>
## 1 com,maxmind)/cityisporgsample.… 2009-10-18 00:00:00 http://www.maxmind.com:80/cityispor… text/pla… 200        WGT2VMJ6SRI… 9.57e2
## 2 com,maxmind)/download/geoip/cs… 2003-02-23 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        2QUN23TUA24… 5.60e2
## 3 com,maxmind)/download/geoip/cs… 2003-02-23 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        NTF247I5W5P… 7.86e2
## 4 com,maxmind)/download/geoip/da… 2006-01-11 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        OFDELJCECME… 1.21e6
## 5 com,maxmind)/download/geoip/da… 2006-06-20 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        3INKOCVKMG6… 1.16e6
## 6 com,maxmind)/download/geoip/da… 2007-11-11 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        E2AT3XS3YLQ… 2.95e6
## 7 com,maxmind)/download/geoip/da… 2008-07-09 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        4YRNZBZ4VFH… 3.76e6
## 8 com,maxmind)/download/geoip/da… 2008-08-13 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        HG7GQQQZUV6… 3.85e6
## 9 com,maxmind)/download/geoip/mi… 2014-03-02 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200        MW7F7GGPJLG… 3.26e4

Now to work on the "download from that point in time" functionality.

Sep 18 '18 12:09 hrbrmstr

Looks like it's not much more than calling read_memento():

dat <- read_memento(csv$original[9], as.POSIXct(csv$timestamp[9]), "raw")

readr::read_csv(dat, col_names = c("iso2c", "regcod", "name"))
## Parsed with column specification:
## cols(
##   iso2c = col_character(),
##   regcod = col_character(),
##   name = col_character()
## )
## # A tibble: 4,066 x 3
##    iso2c regcod name               
##    <chr> <chr>  <chr>              
##  1 AD    02     Canillo            
##  2 AD    03     Encamp             
##  3 AD    04     La Massana         
##  4 AD    05     Ordino             
##  5 AD    06     Sant Julia de Loria
##  6 AD    07     Andorra la Vella   
##  7 AD    08     Escaldes-Engordany 
##  8 AE    01     Abu Dhabi          
##  9 AE    02     Ajman              
## 10 AE    03     Dubai              
## # ... with 4,056 more rows

Sep 18 '18 12:09 hrbrmstr

Hi @hrbrmstr sorry yesterday I was super busy with the kiddos. I will try this tonight and let you know. And please, how can I pretend having contributed to the package when you did all the work??? It is simply a pleasure to be able to share ideas and see them implemented so fast!

Thanks!

Sep 19 '18 11:09 randomgambit

@hrbrmstr the function cdx_basic_query looks pretty smooth. However I wonder why looking on the archive website directly returns 100k results

While using the api only returns 10k.


cdx <- cdx_basic_query("https://imdb.com/", "prefix")
> 
> cdx
# A tibble: 10,000 x 7
   urlkey                          timestamp           original                            mimetype statuscode digest      length
   <chr>                           <dttm>              <chr>                               <chr>    <chr>      <chr>        <dbl>
 1 com,imdb)/                      1996-11-19 00:00:00 http://imdb.com:80/                 text/ht~ 200        XLXNEHRIAG~   1725
 2 com,imdb)/%23                   2006-05-30 00:00:00 http://www.imdb.com:80/%23          text/ht~ 200        CUH3KMB2GO~    837
 3 com,imdb)/%23imdb2.consumer.ho~ 2009-08-19 00:00:00 http://www.imdb.com:80/%23imdb2.co~ text/ht~ 200        AYZ5SY67IR~    688
 4 com,imdb)/%23imdb2.consumer.ho~ 2009-03-11 00:00:00 http://www.imdb.com:80/%23imdb2.co~ text/ht~ 200        AYZ5SY67IR~    673

Could we have an option to specify that we want everything? Indeed, once downloaded locally it will very easy to parse the correct links.

What do you think?

Sep 20 '18 02:09 randomgambit

yep, just set the limit parameter to something higher than the 10K it's defaulted to ;-)

Sep 20 '18 13:09 hrbrmstr

haha nice thanks i overlooked that default

Sep 20 '18 13:09 randomgambit