web-monitoring-processing icon indicating copy to clipboard operation
web-monitoring-processing copied to clipboard

Tools for access, "diff"-ing, and analyzing archived web pages

Results 9 web-monitoring-processing issues
Sort by recently updated
recently updated
newest added
trafficstars

Some pages have a `` element in their markup, indicating a correct, “canonical” URL for the page (some more info here: https://en.wikipedia.org/wiki/Canonical_link_element). When importing data from the Wayback Machine, it...

enhancement
good-first-issue

Long ago, we worked around an issue where we were getting lots of connection failures from Wayback with a dirty hack: if we ran out of retries but still had...

The Internet Archive import script(s) (`wm import ia` and `wm import ia-known-pages`) should have an option that causes them to upload Mementos to S3: ```sh $ wm import ia 'http://www.epa.gov/'...

never-stale

The import script has gotten pretty crazy and messy over time, and we could remove a lot of the complexity. Some is just because it’s taken us a while to...

never-stale

We will want to run tests against real lists of changes that were flagged for review. Some of the elements of these lists are already public because they are the...

never-stale

As a first test of all the things needed to automatically rate a change’s significance, priority, let’s start with something simple that looks for changes that we can pretty confidently...

enhancement
differs
never-stale

Some pages get captured a *lot* by the Internet Archive, and it’s not really necessary or valuable for us to import and track every one of those captures. Now you...

When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s `` element or use the empty string: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/9c6a2cfed53c32e886ae16ce287878beffbf9622/web_monitoring/utils.py#L169-L180 There are...

I really like how cloudpathlib gives gives us a [fairly transparent](https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/ef7881ae096a30bb69b32ea111d0f9e9dc1ea086/web_monitoring/cli/cli.py#L937-L949) way to interchangeably handle local and S3 paths for writing files. BUT there is a fancy new kid on...

enhancement
idea