ipwb Pretty listing of resources and title for HTML pages

Currently the bare URLs are listed for the archived resources (as shown in #177). We can make it better looking and less space consuming by:

Showing titles of HTML pages and hyperlinking corresponding mementos
For resources where title is not present or is not applicable, their file name can be extracted from the URL
As a fallback, URLs can be used if nothing else is feasible
Along with the title, if generated, thumbnails would also be a good way to present resources.

Title of each memento can be stored in the CDXJ file as an optional filed. Title extraction would require HTML parsing at the time of indexing.

Jun 06 '17 17:06 ibnesayeed

@ibnesayeed Do you believe title extraction should be the default functionality or only activated when a flag is passed to the indexing script?

My vote is the former, though it makes the CDXJ more verbose but richer and more user-friendly when parsed and displayed. We may also offer the option to enrich CDXJ TimeMaps that do not have this information from within the replay interface, e.g.,

A sample CDXJ w/o title attributes

!context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200"}

... being passed to the replay system, then an "Enrich" button hit to change the CDXJ to:

A sample CDXJ w/o title attributes

!context ["http://oduwsdl.github.io/contexts/cdxj"] !meta {"created_at": "2017-07-14T09:02:23.458675", "generator": "InterPlanetary Wayback v.0.2017.07.10.1739"} com,matkelly)/froggies/frog.png 20170301192639 {"locator": "urn:ipfs/QmUeko8zM7Xanwz6F9GtRH4rLAi4Poj3DMECGsci2BRQfs/QmPhMnX74cwqx2xgj9d3N3gTra8CzafXwSbUwU8xagMfqR", "mime_type": "image/png", "status_code": "200"} com,matkelly)/robots.txt 20170301192639 {"locator": "urn:ipfs/Qmbk3Aju7u26Pzk356a43wY9eUCScAJiLPxhvwsMoVt7Pd/QmYNB85U2txRAAdLp6wvZSPvd8AQq8UcjZJ2azhv5h6NF7", "mime_type": "text/plain", "status_code": "200"} edu,odu,cs)/~mkelly/semester/2017_spring/remotefroggie.html 20170301192639 {"locator": "urn:ipfs/QmPdyY6Pm66iWtGpTc7PqK11hvsnYSKMVL57G69RiNjGcm/QmNZ6mKSSAXAmXEocQj5gT4y4kdcr5D2C173ubWJ6PSKEZ", "mime_type": "text/html", "status_code": "200", "title": "Lorem Ipsum"}

This will have ramifications of generating a different hash if the CDXJ is itself pushed into IPFS, a use case I anticipate for collaboration/sharing of a collection of captures. With the eventual IPNS integration and our indexless system (#61), the ramifications would be less severe.

Jul 14 '17 13:07 machawk1

Either one is fine for me. Indexing is a one-time job, so it is fine if it takes a bit of extra time in title extraction, but can be skipped with a flag when a lot of data is to be indexed and more index annotations may follow later. Just make sure to sanitize the extracted title to clean up any leading or trailing white spaces and converting newlines (if any) to spaces before storing in the CDXJ.

As long as we are not storing raw CDXJ files in IPFS, there is no harm in adding titles later. The newly proposed model can utilize IPLD for attaching such metadata.

Jul 14 '17 17:07 ibnesayeed

Now we extract titles from the HTML pages and store them in the index.

Aug 27 '18 15:08 ibnesayeed

@ibnesayeed Do you want to work on surfacing these values to replace the URI-R+datetime that is currently displayed? Also, thoughts on retaining the display of the URI-R to correspond with the title? Perhaps dimmed/gray, smaller, and adjacent to the title? I would like to continue to see the URI-R in some fashion without something like a hover.

Aug 27 '18 15:08 machawk1

On the landing page we only want to surface just a handful of captures that meet certain criteria. For them we can make cards/chips that will hold more information in a more appealing way. We can either use some minimal card formatting or go for MementoEmbed style cards on the landing page for a few URI-Ms. /cc @shawnmjones.

Aug 27 '18 18:08 ibnesayeed

@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? Should we also give the user the option of an extended interface to see a comprehensive list?

There were times in developing WAIL that a list of all URI-Rs archived would have been handy from the replay system.

Aug 27 '18 18:08 machawk1

An comprehensive pretty listing with filter and pagination or raw CDXJ index downloading should go in the admin interface.

Aug 27 '18 19:08 ibnesayeed

Any ideas on the heuristic we use for which mementos are displayed?

We do not use any heuristics and let the browser handle it if no content-type was recorded. Some web archives do have some logic in place to predict content-type when missing, but their accuracy is not perfect.

Aug 27 '18 19:08 ibnesayeed

@ibnesayeed Any ideas on the heuristic we use for which mementos are displayed? was not asking about content type but rather, if we have 100 mementos, which are displayed, even if they are all HTML? Random? Newest? Largest? Let the user decide? If so, what's the default?

Aug 27 '18 19:08 machawk1

The item must be an HTML page. From there we can either go for k number of random items, newest items, most archived items, or all of these under different sections.

Aug 27 '18 19:08 ibnesayeed