[Bug]: it is not possible to reference a wacz file back to where it comes from e.g. using the GUI crawl_id
Browsertrix Version
v1.11.3-12f994b
What did you expect to happen? What happened instead?
When you download wacz files using the API you get wacz filenames like "20230225142507561-manual-20230225141525-7c09730b-c08.wacz but it has no refence back to where it comes from e.g. by using the crawl_id and it is not possible to search for the wacz filename or part of it in the browsertrix GUI.
Reproduction instructions
see above
Screenshots / Video
No response
Environment
No response
Additional details
No response
Hi @tuehlarsen, the manual-20230225141525-7c09730b-c08 part of the WACZ fiename should be the crawl id in Browsertrix! You can check the crawl id field in the crawl's Overview tab to verify. A timestamp is added to the beginning in order to provide unique filenames when there are multiple WACZ files per crawl.
It's true that there's currently no way to search in the Archived Items table by this crawl id - that's an oversight that we should likely fix!
It's worth noting that the same crawl id is part of the naming convention for the WARC files within the WACZ as well, but the WARC filenames have additional prefixes such as the first seed URL that the WACZ files don't have (in part I think to keep filenames reasonably small for portability, but we could reassess that).
Ok, if you search for crawl name "dmi.dk" in browsertrix you find the crawl_id: manual-20240718145854-611eb86b-1c5 but all the wacz files are named this way: 20240718154355358-611eb86b-1c5-0.wacz 20240718154755196-611eb86b-1c5-1.wacz You need the full crawl_id in the file name to use e.g. crawl_id in the API. Where can I find the missing part manual-20240718145854- in the wacz files?
Ah @tuehlarsen, I forgot that this is actually configurable in the Helm chart! Which explains why what I was seeing on our dev server differed. In chart/values.yaml, take a look at the following setting:
# default template for generate wacz files
# supports following interpolated vars:
# @ts - current timestamp
# @hostname - full hostname
# @hostsuffix - last 14-characters of hostname
# @id - full crawl id
default_crawl_filename_template: "@[email protected]"
The default only includes the timestamp, but you can use the @id variable to include the full crawl id in the filenames :)
crawl_id is allways a part of the included warc.gz filenames. If you sep. by '-' it is allways pos 4-7. We can now figure out to use the API calls with the crawl_id. The only thing missing now is a GUI way to search for the crawl_id.
e.g.
GUI crawl_id: manual-20240718145854-611eb86b-1c5
20240718154355358-611eb86b-1c5-0]$ ls archive
kb-dmi-dk-manual-20240718145854-611eb86b-1c5-20240718145909636-0.warc.gz
...
Perhaps we can just change the default to @id-@ts on prod, or maybe make this configurable.