Implement CDX search based on newer `timemap` CDX API
From a conversation on the Internet Archive’s Research Slack today:
kenji Igor http://spacex.com/robots.txt has
Disallow: /includes/and http://web.archive.org/cdx/search still honors robots.txt exclusion (because it’s served by older wayback machine), while playback ignores robots.txt (served by new wayback machine).http://web.archive.org/web/timemap/cdx?url=www.spacex.com&matchType=domain&gzip=false&filter=statuscode:200&to=20041229235959 will give you more results, including those under
/include/path. /web/timemap/cdx is served by new wayback.I’m sorry for the confusing, inconsistent results - we’re trying to migrate all services to new wayback
oh btw, a tip:
to=2004will be interpreted as20041231235959(if you’re not excluding day 30 and 31 on purpose :smile:) (edited)Igor kenji Thank you!
mr0grog Oh, I did not know about
/web/timemap/cdxas opposed to just/cdx/search/cdx. Should I be using the former instead of the latter?kenji /web/timemap/cdx is better functionality-wise, but it’s slower than /cdx/search. So I’d suggest /cdx/search as long as it works ok for your purpose.
mr0grog ah, ok Will need to consider which is the right path. Is there anything that documents the functional differences? e.g. the robots.txt issue would be a hard one to discover
Do you have a rough sense of how much slower
/web/timemap/cdxis?kenji I don’t have good benchmark result (it’s nice to have), but I find /web/timemap/cdx 10-20% slower for
matchType=exactquery.matchType=domaincan be much slower.
We need to look into whether we should switch to /web/timemap/cdx.
Other notes I have discovered in edgi-govdata-archiving/web-monitoring-processing#174: this new API doesn’t support resumeKey; you have to use page and pageSize for iterating through results (which is not as straightforward as you might think).
Update: since the above conversation happened, Wayback folks have started gently pushing us to more actively use the newer services, like timemap CDX and SPN2. So I think the answer to this issue is probably “yes we should” now.
Since this is still beta-ish, we should probably implement this alongside the old /cdx/search API.
I’ve been holding off on this since @danielballan is in the middle of splitting off this code into http://github.com/edgi-govdata-archiving/wayback. It should be done, but in that new repo whenever it’s ready.
Note to selves: once this is closed, it might be kind to state in the release notes how to migrate wayback v0.1 code to whatever API we settle on for timemap, if doing so is not too much trouble.
FWIW, I think the API (from a user of this package’s perspective) would be the same. The Timemap CDX API (which, to be clear, is not the timemap API, which is a whole other thing!):
-
Returns data in the same format as the CDX API, but has some extra fields on the end that aren’t generally useful unless you have access to internal archive.org services (supposedly these will be removed from the public API at some point).
-
Does paging differently, but we don’t expose access to the paging in our Python API anyway, so this should mostly be an implementation detail that is largely invisible to a user. (In the current CDX API, you can paginate via
resumeKeyor via actual page size & number, but the latter will not give you recent data. In the new Timemap CDX API, there is noresumeKeyand you must use page size & number, but it should include up-to-date data.)
Ah, I was conflating the Timemap CDX API with the timemap API. I have half-absorbed the fact that they are different things, but I got confused here. Which one did wayback v0.1 implement?
Wayback v0.1 implemented the Timemap API (not Timemap CDX, which isn’t really it’s name, but it doesn’t have one, and ¯\_(ツ)_/¯).
If helpful (since Wayback APIs are a half-documented, scattered situation):
The CDX API, which lets you search through a CDX-based index (and returns a subset of fields from each matching CDX record), is at http://web.archive.org/cdx/search/cdx
The “Timemap CDX” API is the same thing, but uses different code and (I think?) a separate CDX index, is at http://web.archive.org/web/timemap/cdx
(I call it “Timemap CDX” because of the URL. I have also heard “new CDX,” “beta CDX,” “CDX v2,” etc.)
The Timemap API is part of the Memento protocol (guide, RFC, Wayback-specific “docs”) which is a semi-standard agreed to by lots of archives. It doesn’t allow searching (it just lists mementos for a given URL), and lists results in HTTP Link header format at http://web.archive.org/web/timemap/link/<url>, e.g. http://web.archive.org/web/timemap/link/https://www.epa.gov/
(There is supposed to be an official JSON format, but I don’t know how to get it from Wayback. http://web.archive.org/web/timemap/json/<url> returns timemap data in CDX-json format, which is 🤷♀)
I kind of feel like Timemap may be redundant when you have CDX available (since you can always search CDX for an “exact” [really SURT, not exact] URL match). But it’s possible timemap may be more optimized.
Also, best documentation link I know of is here: https://archive.readme.io/docs
It’s mostly links to other docs, but at least it gets most of all the APIs listed. (Not how much it’s kept up-to-date, though. 🙁)
Some updates here from recent conversations:
- The old CDX search (
/cdx/search/cdx) has some real funky issues aroundlimitandshowResumeKeythat were major drivers for this new CDX search (/web/timemap/cdx). (See #65) - The new search supports
limit, but notshowResumeKey, and doesn’t do weird stuff withlimit. - The new search only paginates with
page+pageSize(which are still about blocks;sizeis not referring to a number of results), and is reliable, and includes all the indexes (so it’s up-to-date). - BUT if you use a non-exact search (i.e.
matchType=prefix|host|domainor you use an*in the URL), it does not include the index for recent SavePageNow captures. It takes roughly 3 days for things in that index to make it into other indexes that do support those queries. So there are still caveats here, but they are simpler to explain and are actually pretty predictable (the out-of-date issue is only a few days, not a few months). - archive.org is doing a slow transition to the new search, using it for some things under the hood to test it out.
- Eventually (no concrete timeline yet) the old search will be replaced with the new one.
- The new search includes extra fields (length, offset, WARC filename) that they expect to remove when replacing the old search, so we should not expect them to always be present.
So I think we probably need to ultimately have 3 methods for CDX search (these names are strawman proposals, they probably aren’t great):
-
search_v1()uses/cdx/search/cdxand paginates viashowResumeKey(i.e. what is currently calledsearch()). -
search_v2()uses/web/timemap/cdxand paginates viapage+pageSize(i.e. the new search). -
search()just forwards to one of those implementations.
I’m also thinking we might want to rename search*() methods to listMementos() or listCaptures() or something, since the Internet Archive has an actual free text search of wayback now (e.g. https://web.archive.org/web/*/environment which is powered by https://web.archive.org/__wb/search/anchor?q=<text>, but also some endpoints at https://be-api.us.archive.org/ia-pub-fts-api, /services/search/v1/scrape, and /advancedsearch.php, all of which I don’t know enough about the differences or pros/cons for).
That renaming might be out of scope here, though.
Circling back on the naming issue here, my current feeling is that the name should involve timemap rather than v2. The two have existed alongside each other for a long time now, and it’s no longer clear exactly what the migration or succession path is supposed to be (at one point I was told that the old CDX search at /cdx/search/cdx would call into the new implementation at /web/timemap/cdx under the hood, but trying the two confirms that they hit different backend servers and behave differently, and it’s been several years).