Tessa Walsh
Tessa Walsh
Yay! A bit of spitballing: * I think some sort of timeline/clustering visualization of the last modified dates could be interesting (although it's dependent on FITS, so won't be available...
Another idea that would also take some backend work: it could be interesting to try to visualize the relationships between the original files and their preservation derivatives - comparing formats...
I totally get that! Sounds good :)
Hi Ashley! I did a bit more layout work on this and think it's ready to go live as-is if you're keen! Of course you're always welcome to open new...
Thank you! I appreciate that you spent some of your rare free time on this, and the door will always be open if you wanna do any more! I'm going...
Hi @dbuenzli , thanks for these comments. In terms of the new fields, yes, perhaps we should create/propose an extension to the core WARC format with these new fields, and...
We are in the process of documenting these new headers and fields, tracking in https://github.com/webrecorder/browsertrix/issues/issue/1588
Improved logging merged in #195. Significant changes include: - Logs are output as json-l with proper log levels and contexts to support filtering - Page crawl graph data included -...
@despens it seems like the main outstanding issue from your comment is that getting TLDs from `pages.jsonl` can be difficult because of the presence of extracted full text, which seems...
Moved to Playwright in https://github.com/webrecorder/browsertrix-crawler/commit/82808d813321c6c5860a529414e20e2638887b31