openwayback
openwayback copied to clipboard
The OpenWayback Development
JavaScript injection is used to display the existing OpenWayback header that allows navigation in time and provides information about the current resource. This only works for HTML pages and even...
We've just noticed a few `timeTrunc` errors in our crawl logs and the resulting `WARC-Truncated: time` headers in our WARC records. Does OpenWayback handle these specifically? At the moment it...
such as: - maintaining the form of URL (absolute, protocol-relative, server-relative, path-relative) - to make it less likely to break when manipulated by JavaScript. - changing protocol depending on `X-Forwarded-Proto`...
Here is how we currently recommend researchers to reference a given web page in the danish webarchive, fx. (URL + provenance): http://netarkivet.dk 197800-188-20140107085943-00000-sb-prod-har-005.statsbiblioteket.dk.warc/4773261 (9:01:06 jan 7, 2014 in UTC time)....
We're using the `FlexResourceStore` and `ZipNumBlockLoader` in our `CDXCollection.xml`. If the `numRetries` in `ZipNumBlockLoader` is set to anything > 1, then you can see errors like this in the log:...
The de facto standard for serving web archive indexes seems to be via CDX, not BDB. However, Wayback defaults to BDB out-of-the-box and our current [documentation](https://github.com/iipc/openwayback/wiki/How-to-configure) recommends commenting out/un-commenting the...
When crawling using Heritrix, if both `sendIfModifiedSince` and `writeRevisitForNotModified` are set to `true` (although the latter has been deprecated, presumably equivalent to always being `true`), a server may respond with...
Following on from the merging of #189, the [ResourceFactory](https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java) could use an overhaul. Currently the code is a series of `if` statements: ``` if(urlOrPath.startsWith("http://")) ... } else if(urlOrPath.startsWith("hdfs://") || urlOrPath.startsWith("s3://")...
Scale: # hits on host, aggregate and clustering results. **S'sheet line:** 2 **For whom?** BNF, BL, DN, IA **Notes:** New CDX server should enable this. **Est. Milestone:** 2.x.x
It would be interesting if the wayback machine provided ways to do visual diffs on webpages via [resemble.js](http://huddle.github.io/Resemble.js/) and perhaps [html diffs](https://github.com/christian-oudard/htmltreediff) as well.