how to parse attachments/files and download them!
Hi Harbour Master,
yet another brillant package from you! I wonder if there is an easy way to pull all the files from the archived website in the wayback archive. For instance, something like "get all the .pdfs from all archives (in a given time range) from this website".
I do these kind of queries manually on the wayback archive, and it is very time consuming and annoying. Being able to do that programmatically with your package would be really nice.
What do you think? Thanks!
thx.
well, "aye" but the pkg doesn't do it yet. I've had the "scraping api" on my "todo" list for a while but haven't had the time to work on it. Ref: https://archive.org/help/aboutsearch.htm & https://archive.org/advancedsearch.php & https://archive.readme.io/docs/
Lemme see how much effort it'll take to add in support (paginated APIs on resource-constrained sites are so not-fun to work with).
@hrbrmstr amazing, that would be great. I really believe this is what most people do with the archive "How can I get that annoying old zip file that was available 3 years ago??"
When you get some time, it'd be 👍 If you could poke at the (just added) nascent "Scrape API" calls (https://github.com/hrbrmstr/wayback/blob/master/R/ia-scrape.R) and then let me know what extra helpers I should add to support the use case.
sure of course. let me try that asap! thanks!
Give ia_retreive() a go. I think that might be what you were looking for (just added)
https://github.com/hrbrmstr/wayback/blob/master/R/ia-retrieve.R
@hrbrmstr that seems pretty neat but I wonder if I have explained correctly what I had in mind. Imagine that you are interested in the free csvs from maxmind.com
Now going to https://web.archive.org/web//http://maxmind.com/ (<-- add the star at the end) shows you ALL the links on the maxmind domain that were saved in the archive. You can see that there is a field where you can filter by type, say csv, or pdf.
This is hugely valuable because you can pull all the attachments at once from a website, but is it as real PITA because it has to be manual. I wonder if your package can retrieve that information or perhaps I have misunderstood what you did.
Thanks!
AH!
Gotcha. Let me see how that works memento/timemap-API-wise. Pretty sure I can rig up something.
Looks like there's a "new-ish" CDX parameter used in that particular online query interface that I did not have support for in the package. I've added it to the cdx_basic_query() function and (as noted below) I think it provides the assistance you were inquiring about.
Def let me know if I need to tweak this more and — if you have some time and wouldn't mind — please add yourself to the DESCRIPTION (a new person() item) as a contributor (ctb) as this was an immensely helpful suggestion and discussion.
library(wayback)
library(tidyverse)
cdx <- cdx_basic_query("http://maxmind.com/", "prefix")
filter(cdx, grepl("csv", original))
## # A tibble: 43 x 7
## urlkey timestamp original mimetype statuscode digest length
## <chr> <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 com,maxmind… 2002-10-14 00:00:00 http://maxmind.… text/html 200 IFDVCDHMB… 2733.
## 2 com,maxmind… 2006-05-17 00:00:00 http://www.maxm… text/html 200 7HYYDOKDG… 1717.
## 3 com,maxmind… 2009-02-11 00:00:00 http://maxmind.… text/html 301 BCL36PMUW… 405.
## 4 com,maxmind… 2008-12-10 00:00:00 http://www.maxm… text/html 200 JZCCABPE7… 1962.
## 5 com,maxmind… 2009-10-18 00:00:00 http://www.maxm… text/pla… 200 WGT2VMJ6S… 957.
## 6 com,maxmind… 2003-08-15 00:00:00 http://www.maxm… text/html 404 U5A45F3Y2… 392.
## 7 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404 Y3VUK7LZQ… 413.
## 8 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404 Z4BTKJJPQ… 413.
## 9 com,maxmind… 2009-07-05 00:00:00 http://www.maxm… text/html 404 WXKDYKM67… 411.
## 10 com,maxmind… 2006-11-27 00:00:00 http://www.maxm… text/html 404 XVCYDXUBM… 421.
## # ... with 33 more rows
Hrm. I just made that a bit better by also adding in support for filtering (like the web ux has). By default it only returns items with a 200 status code.
library(wayback)
library(tidyverse)
cdx <- cdx_basic_query("http://maxmind.com/", "prefix")
(csv <- filter(cdx, grepl("\\.csv", original)))
## # A tibble: 9 x 7
## urlkey timestamp original mimetype statuscode digest length
## <chr> <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 com,maxmind)/cityisporgsample.… 2009-10-18 00:00:00 http://www.maxmind.com:80/cityispor… text/pla… 200 WGT2VMJ6SRI… 9.57e2
## 2 com,maxmind)/download/geoip/cs… 2003-02-23 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 2QUN23TUA24… 5.60e2
## 3 com,maxmind)/download/geoip/cs… 2003-02-23 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 NTF247I5W5P… 7.86e2
## 4 com,maxmind)/download/geoip/da… 2006-01-11 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 OFDELJCECME… 1.21e6
## 5 com,maxmind)/download/geoip/da… 2006-06-20 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 3INKOCVKMG6… 1.16e6
## 6 com,maxmind)/download/geoip/da… 2007-11-11 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 E2AT3XS3YLQ… 2.95e6
## 7 com,maxmind)/download/geoip/da… 2008-07-09 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 4YRNZBZ4VFH… 3.76e6
## 8 com,maxmind)/download/geoip/da… 2008-08-13 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 HG7GQQQZUV6… 3.85e6
## 9 com,maxmind)/download/geoip/mi… 2014-03-02 00:00:00 http://www.maxmind.com:80/download/… text/pla… 200 MW7F7GGPJLG… 3.26e4
Now to work on the "download from that point in time" functionality.
Looks like it's not much more than calling read_memento():
dat <- read_memento(csv$original[9], as.POSIXct(csv$timestamp[9]), "raw")
readr::read_csv(dat, col_names = c("iso2c", "regcod", "name"))
## Parsed with column specification:
## cols(
## iso2c = col_character(),
## regcod = col_character(),
## name = col_character()
## )
## # A tibble: 4,066 x 3
## iso2c regcod name
## <chr> <chr> <chr>
## 1 AD 02 Canillo
## 2 AD 03 Encamp
## 3 AD 04 La Massana
## 4 AD 05 Ordino
## 5 AD 06 Sant Julia de Loria
## 6 AD 07 Andorra la Vella
## 7 AD 08 Escaldes-Engordany
## 8 AE 01 Abu Dhabi
## 9 AE 02 Ajman
## 10 AE 03 Dubai
## # ... with 4,056 more rows
Hi @hrbrmstr sorry yesterday I was super busy with the kiddos. I will try this tonight and let you know. And please, how can I pretend having contributed to the package when you did all the work??? It is simply a pleasure to be able to share ideas and see them implemented so fast!
Thanks!
@hrbrmstr the function cdx_basic_query looks pretty smooth. However I wonder why looking on the archive website directly returns 100k results

While using the api only returns 10k.
cdx <- cdx_basic_query("https://imdb.com/", "prefix")
>
> cdx
# A tibble: 10,000 x 7
urlkey timestamp original mimetype statuscode digest length
<chr> <dttm> <chr> <chr> <chr> <chr> <dbl>
1 com,imdb)/ 1996-11-19 00:00:00 http://imdb.com:80/ text/ht~ 200 XLXNEHRIAG~ 1725
2 com,imdb)/%23 2006-05-30 00:00:00 http://www.imdb.com:80/%23 text/ht~ 200 CUH3KMB2GO~ 837
3 com,imdb)/%23imdb2.consumer.ho~ 2009-08-19 00:00:00 http://www.imdb.com:80/%23imdb2.co~ text/ht~ 200 AYZ5SY67IR~ 688
4 com,imdb)/%23imdb2.consumer.ho~ 2009-03-11 00:00:00 http://www.imdb.com:80/%23imdb2.co~ text/ht~ 200 AYZ5SY67IR~ 673
Could we have an option to specify that we want everything? Indeed, once downloaded locally it will very easy to parse the correct links.
What do you think?
yep, just set the limit parameter to something higher than the 10K it's defaulted to ;-)
haha nice thanks i overlooked that default