browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id

Open tuehlarsen opened this issue 1 year ago • 6 comments

Browsertrix Version

v1.11.3-12f994b

What did you expect to happen? What happened instead?

When you download wacz files using the API you get wacz filenames like "20230225142507561-manual-20230225141525-7c09730b-c08.wacz but it has no refence back to where it comes from e.g. by using the crawl_id and it is not possible to search for the wacz filename or part of it in the browsertrix GUI.

Reproduction instructions

see above

Screenshots / Video

No response

Environment

No response

Additional details

No response

tuehlarsen avatar Aug 27 '24 13:08 tuehlarsen

Hi @tuehlarsen, the manual-20230225141525-7c09730b-c08 part of the WACZ fiename should be the crawl id in Browsertrix! You can check the crawl id field in the crawl's Overview tab to verify. A timestamp is added to the beginning in order to provide unique filenames when there are multiple WACZ files per crawl.

It's true that there's currently no way to search in the Archived Items table by this crawl id - that's an oversight that we should likely fix!

tw4l avatar Aug 27 '24 13:08 tw4l

It's worth noting that the same crawl id is part of the naming convention for the WARC files within the WACZ as well, but the WARC filenames have additional prefixes such as the first seed URL that the WACZ files don't have (in part I think to keep filenames reasonably small for portability, but we could reassess that).

tw4l avatar Aug 27 '24 13:08 tw4l

Ok, if you search for crawl name "dmi.dk" in browsertrix you find the crawl_id: manual-20240718145854-611eb86b-1c5 but all the wacz files are named this way: 20240718154355358-611eb86b-1c5-0.wacz 20240718154755196-611eb86b-1c5-1.wacz You need the full crawl_id in the file name to use e.g. crawl_id in the API. Where can I find the missing part manual-20240718145854- in the wacz files?

tuehlarsen avatar Aug 27 '24 14:08 tuehlarsen

Ah @tuehlarsen, I forgot that this is actually configurable in the Helm chart! Which explains why what I was seeing on our dev server differed. In chart/values.yaml, take a look at the following setting:

# default template for generate wacz files
# supports following interpolated vars:
# @ts - current timestamp
# @hostname - full hostname
# @hostsuffix - last 14-characters of hostname
# @id - full crawl id
default_crawl_filename_template: "@[email protected]"

The default only includes the timestamp, but you can use the @id variable to include the full crawl id in the filenames :)

tw4l avatar Aug 27 '24 15:08 tw4l

crawl_id is allways a part of the included warc.gz filenames. If you sep. by '-' it is allways pos 4-7. We can now figure out to use the API calls with the crawl_id. The only thing missing now is a GUI way to search for the crawl_id. e.g. GUI crawl_id: manual-20240718145854-611eb86b-1c5 20240718154355358-611eb86b-1c5-0]$ ls archive kb-dmi-dk-manual-20240718145854-611eb86b-1c5-20240718145909636-0.warc.gz
...

tuehlarsen avatar Aug 27 '24 17:08 tuehlarsen

Perhaps we can just change the default to @id-@ts on prod, or maybe make this configurable.

ikreymer avatar Jan 30 '25 22:01 ikreymer