pywb icon indicating copy to clipboard operation
pywb copied to clipboard

[feature request] Way to list all URLs in pywb

Open andrew-d opened this issue 8 years ago • 12 comments

Hello there,

I'd like to make a feature request - a way to list all URLs in pywb. Web Archive Player has a similar feature, and it would be nice to have a version in this project, for use when I don't want to (or am unable to) run a desktop app.

Thanks, --Andrew

andrew-d avatar Jan 06 '16 05:01 andrew-d

Do you mean list all urls or list all 'pages'. WebArchivePlayer lists what it detects as pages (usually HTML), but the detection is not perfect.. It is also possible to search for urls by prefix or host/domain, but there is not a mechanism for listing all of them.., though it might be possible to add (with some sort of pagination support, as it could be very large list)

ikreymer avatar Mar 09 '16 15:03 ikreymer

Perhaps this option would be applicable when running pywb.

http://localhost:8080/coll-cdx?urls=all&page=1&filter=mime:text/html&limit=10

Given that currently pywb requires only the url parameter be set so possibly a give me all urls could be another "optional" required parameter

N0taN3rd avatar Sep 07 '16 00:09 N0taN3rd

How come it's available to index archived urls on the webrecorder.io-frontend 2018-09-29-115237_1600x900_scrot but not on pywb directly? 2018-09-29-115816_1600x900_scrot

There's no list of all archived pages on pywb but when one searches for a particular page, one gets a detailed result. 2018-09-29-115708_1600x900_scrot Can this UI be used to show all archive pages instead of just one?

?urls=all doesn't seem to work

Serkan-devel avatar Sep 29 '18 10:09 Serkan-devel

@Serkan-devel that level of curatorial specificity is a webrecorder only feature currently.

All tho replay via pywb is collection centric, pywb currently only provides the facilities to manage collections of web archives (create, add to, and index) and then replay the contents of a collection.

That is to say pywb is primarily concerned with the replay side of collections.

However, we have been thinking about how to provide some kind of page level specifity to pywb. But that feature requires heuristic evaluation of a collections index.... If you or anyone else would like to attempt to implement this feature we would be open to it :smiley:

N0taN3rd avatar Sep 30 '18 01:09 N0taN3rd

But I'm not that good in python and I'm afraid to commit directly because someone is watching my steps on github who might ambush me but I really need this feature.

Can anyone link me to the exact files, responsible for showing indexes on the pywb webui and the indexing scripts on webrecorder as reference?

Serkan-devel avatar Nov 13 '18 16:11 Serkan-devel

Do all searches run through this python script?

I'm afraid to fork this project publicly on github. But could I send patches for better url-querying by email if I do succeed?

Serkan-devel avatar Nov 20 '18 17:11 Serkan-devel

To be clear, you are looking for a list of pages in the same way as they are listed in Webrecorder?

Are you using WARCs created in Webrecorder specifically or any web archive in general?

We would like to support this in the future, but there's a few issues to resolve as how best to do this in pywb.

ikreymer avatar Nov 20 '18 20:11 ikreymer

Yes, I'd like to list urls, even if I haven't entered them completely, like on webrecorder. Listing all pages at the same time would be great too.

I do have warcs both created within a local webrecorder instance and also recorded directly with pywb.

While I'm getting closer to understand how cdx works, what are the issues, blocking the implementation of better listing?

Serkan-devel avatar Nov 20 '18 20:11 Serkan-devel

The primary issue is we need to come up with a standard way for the search to be done and how it will be done. Are we persisting this list or computing it every time (requires heuristics that are not always correct to do this blind).

Can this functionality from webrecorder be used by pywb and or live in pywb. Do we ad a cdx query filter for this.

These are just a few of the questions we have about how to do this in pywb.

If you would like to contribute to this effort pywb/warcserver/index would be a good place to start and following how pywb/apps/frontendapp interacts with the warcserver.

N0taN3rd avatar Nov 21 '18 03:11 N0taN3rd

I think I'm unable to do it.

But could anyone open an issue here about managing cookies on archives? It doesn't seem to be documented here and I don't want another github issue to show up at my timeline. This might be useful when webpages require to be logged in.

Serkan-devel avatar Dec 01 '18 17:12 Serkan-devel

Any more thoughts on this? It's a feature I could really use.

muramasatheninja avatar Jun 09 '21 06:06 muramasatheninja

+1

Jackster avatar Jan 22 '23 22:01 Jackster