warcbase icon indicating copy to clipboard operation
warcbase copied to clipboard

Prototype webapp for browsing link structure

Open lintool opened this issue 11 years ago • 5 comments

Jinfeng: Once links are extracted from the ARC data, they should be in the form of (source, destination) pairs, where both are ids mapped using the Lucene FST. This can be stored in a table inside Derby.

(Remember, don't actually check the data into Git.)

We can embed Derby directly inside a Jetty webapp to avoid the need for a MySQL installation (the tables should be small enough to make the performance reasonable). Let's start will simple functionality like being able to browse forward and backward links, e.g., given a page (URL or id), list both incoming and outing edges.

lintool avatar Mar 20 '14 02:03 lintool

Here's my revised thinking on this: we should probably just store the link structure into an HBase table:

row key = reversed URL (e.g., house.gov.www) column family = "link" column qualifier = timestamp of capture value = list of ids, e.g., "353,546,503" (ids of outgoing links)

So we should have a program along these lines:

hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar \
  org.warcbase.analysis.graph.IngestWebgraph \
  -hdfs INPUT_PATH -output OUTPUT_HBASE_TABLE \
  -urlMapping sampleUrls.fst

We can probably modify ExtractLinks for this purpose, although instead of the reducer writing to HDFS we'd need to write to a HBase table.

lintool avatar Aug 16 '14 23:08 lintool

What sort of queries will be made by the webapp when browsing this link structure table?

saintstack avatar Aug 18 '14 17:08 saintstack

Here's what I'm thinking the REST API will look like, modeled after the Wayback REST schema: http://host/table/URL, where the URL would be something like http://foo.bar.com/. The output would be JSON, something like:

[ { capture : TIMESTAMP1, outlinks : [253, 325, 573 ...] },
  { capture : TIMESTAMP2, outlinks : [253, 325, 573 ...] },
  ...
]

lintool avatar Aug 18 '14 20:08 lintool

Clients will have to then make a new query against hbase to get the actual URLs or will they have a bunch cached over on their side?

saintstack avatar Aug 18 '14 20:08 saintstack

The id -> URL mapping is encoded in a Lucene FST, which is relatively compact. The client can load this in memory and perform the mapping. The idea (mostly in my head, haven't converted into open issues) is that there would be a "metadata" table to hold collection-level metadata like these: the FST, # of docs, ingestion date, etc.

lintool avatar Aug 18 '14 22:08 lintool