browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

[Feature]: URL List: Output WACZ Files for Each URL Crawled

Open Shrinks99 opened this issue 1 year ago • 12 comments

Context

When archiving pages with a seeded crawl workflow we split the WACZ files in 10GB increments. While the UX of this could likely be improved, it is mostly okay as long as a user downloads all of the parts of the archived item.

With seeded crawls, users typically see the entire archived item as one object. With URL lists the intentions are either:

  1. Archive this website according to these exact links (archived item object expectations = file object expectations)
  2. Archive this list of assorted URLs at the same time (archive item object expectations ≠ file object expectations)

We want to continue to support WACZ files as a portible format (they should ideally be portable and self contained for delivery and organization). Whereas all the files for a seeded crawl are required for replay, URL list crawl output files should be self contained if a future "Single file per-URL" option is selected.

What change would you like to see?

As a user, I want more granular control over my URL list crawl outputs.

Questions

  • Is this just how URL list crawls should always output files or should this be a user-defined option?

Requirements

No response

Todo

No response

Shrinks99 avatar Nov 07 '23 03:11 Shrinks99

I am reminded of my own educational example about crawling search queries with URL list crawls! For situations like this, I can see a single file output being preferable for social media type sites as opposed to more generalized search queries on a site like Google that might benefit from single page per wacz grouping for export.

Seems like it should be an option?

Shrinks99 avatar Nov 07 '23 06:11 Shrinks99

I was just looking at this and also would love the ability to split per url. It does look like an option would be the best way, love to contribute if can be helpful.

fservida avatar Nov 07 '23 16:11 fservida

@fservida Would you mind expanding on your use case a little? We have some in mind (as mentioned above) as well as a few specific clients that need this, but would like to know more about what you're looking for :)

Shrinks99 avatar Nov 07 '23 17:11 Shrinks99

Mainly batch collections of standalone URLs, taking an OSINV example, might have worked a day logging pages of interest, let's say a 100, want to archive them all, but want to retain ability to manage, and file/forward to other investigators in the future every single page separately depending on the usage. Currently that would require creating 100 workflow tasks, while a single one creating 100 wacz would be much more practical.

I have a couple other ideas related more to usage of browsertrix-crawler as OSINV collection platform that might file as feature suggestions/pull requests, but I think this specific one would be quite helpful for helping this kind of batched workflow.

fservida avatar Nov 07 '23 20:11 fservida

Alternatively, or improving on the above, given this kind of batch collections, another idea could be to still have a single wacz per workflow, but give the user the option to batch create workflows. Given a set of X URLs and collection parameters it could create X workflows with same parameters, allowing for targeted reruns in case of failures. Right now I was thinking of implementing something akin to this in frontend by simply calling multiple times the api.

fservida avatar Nov 07 '23 20:11 fservida

Sounds like this would work quite well for your above use case, and implementing this is indeed a step forward for re-crawling in the future with some other QA features. With targeted re-runs specifically, I have some further out ideas of how to integrate that into a single workflow, I wouldn't focus on this too much just yet!

We've thought a little bit about batch workflow creation but are trying to avoid it for now for the reasons you specify.

I have a couple other ideas related more to usage of browsertrix-crawler as OSINV collection platform that might file as feature suggestions/pull requests

Please file feature suggestions first! We're definitely open to PRs and outside contribution, but good to make sure we're on the same page to ensure your changes get accepted & reduce duplicate work! :)

Shrinks99 avatar Nov 07 '23 21:11 Shrinks99

@Shrinks99 I'm a bit unfamiliar with how URL lists work & how they are related to viewing the files as a whole.

I would like to be able to specify a single archive (even if it is large) for the sake of being able to register and reference a single archive of a video file

walkerlj0 avatar Nov 13 '23 23:11 walkerlj0

@walkerlj0 Currently URL Lists and Seeded Crawls both save WACZ files in 10GB increments. This has been flagged as something that is blocking Starling, as the org requires each URL saved to be stored in a discrete WACZ file.

For non-Starling org folks who might need context, they work around by using a Python script to spin up individual crawl workflows for each URL to ensure they are saved to individual files — a brute force but effective solution.

With this proposed config option, those URLs would be able to be specified in our UI instead of the Python script and run within a single crawl workflow. If this proposed option is enabled, each URL (and its resources) crawled with that workflow would be saved to discrete WACZ files.

Shrinks99 avatar Nov 13 '23 23:11 Shrinks99

Can you describe it in terms of current functionality vs. new functionality, not using vocabulary only used with Browsertrix? I'm also not sure what you mean by 'file object expectations" I'm also not clear on what a seeded list is (and how it is different than a list of URLs) Does seeded mean you pick a starting point and crawl a certain depth?

@Shrinks99

Example: Currently Browsertrix will create a web crawl of (a certain number? A parameters?) And split it into different WACZ files when the filesize exceeds 10GB

Current workaround is, instead of crawling a site, 1 link deep that will randomly split into a different crawl at 10gb, is to make a list of all URLs on the site and force a WACZ crawl over 10gb

  • This is just me best guess at interpreting what you described

walkerlj0 avatar Nov 14 '23 14:11 walkerlj0

Does seeded mean you pick a starting point and crawl a certain depth?

Yep! You may not run across this much as (AFAIK) you start all your crawling programmatically right now and it doesn't seem to fit Starling's use case. This feature is how most other folks use the app!

I'm also not clear on what a seeded list is (and how it is different than a list of URLs)

Seeded Crawl and URL List are types of crawl workflows.

I'm also not sure what you mean by 'file object expectations"

Probably could have worded this better. My intention was to describe how users think of the things they capture. Starling clearly thinks of each URL as a captured authenticated "object" or "record" whereas other digital preservationists may consider a whole website to be the "object" they are archiving. We want to support both use cases / approaches.

TL;DR

Current functionality: When crawling with a URL List crawl workflow, all of the data is saved in a single WACZ file unless it is over 10GB in size, in which case it will be split in 10GB increments as new data is collection.

With this feature: When crawling with a URL List crawl workflow, users will be able to tick on the proposed setting to save each URL specified in the List of URLs to capture to a discrete WACZ file.

@walkerlj0 This will allow you to start crawls in the UI by pasting the URLs you intend to capture into a new crawl workflow and turning on this setting. Each URL captured will be saved to its own WACZ file giving you the same output you have now but within a single crawl workflow instead 1 crawl workflow for each URL (the current solution).

Shrinks99 avatar Nov 14 '23 16:11 Shrinks99

Aha, @Shrinks99 much more clear - it allows us to split a single crawl up into individual records so we can archive them individually. Thanks!

walkerlj0 avatar Nov 15 '23 22:11 walkerlj0

As I've been working on our upcoming rework of the collection editor — our UI for picking and choosing what archived items are included in a collection — I'm wondering where the best place to expose per-URL visibility for collections would be... Especially considering that it would only work for URL list crawls with this setting toggled on.

Gut feeling is the archived item's files tab, affecting all collections the archived item is in... Though perhaps that's not the most flexible option?

EDIT: After discussion with Ilya, the going plan is to keep archived items self-contained with no selection of files meaning archived items themselves will continue to be the item being curated by users. This is a little bit of a change of plans compared to the above, but means that a workflow with this setting toggled on would create multiple archived items instead of multiple files within one archived item. This allows the user to curate URLs crawled with URL lists within a collection like they do now.

Shrinks99 avatar Nov 16 '23 06:11 Shrinks99