Tessa Walsh
Tessa Walsh
Ad blocking via request interception added in #173, via a new `--blockAds` flag
Include documentation on updating drivers from Puppeteer (crawler
Hi @gitreich - putting this on our sprint board to look into after IIPC WAC :)
I believe that I've hit this same issue attempting to stream the partial contents of a file with aibotocore. My use case is that I am extracting a file from...
> regular warcs + combined warc: `type: combined` or `type: web`? +1 for `web` as best term we've come up with so far for general WARC records capturing web traffic
Hi @cmillet2127, based on [a discussion in the minio-js repo](https://github.com/minio/minio-js/issues/619#issuecomment-326158139) I think the crawler should work as-is and minio-js will autodiscover the bucket if you use `s3.amazonaws.com` as the STORE_ENDPOINT_URL....
Also noticing that `js-wacz` is logging strings to stdout, which breaks our logging format. Might want to see what we can do about that. I suppose if we call it...
TODO: - Add WACZ validation (not yet supported in js-wacz) - Make CDXJ handling more memory-efficient in js-wacz (currently keeps all pages in memory, may OOM with large crawls) -...
Currently migrating the CI from Travis to GitHub Actions. Steps necessary include: - [x] Remove Travis config file and adding GitHub Actions workflow document to repo - [x] Update Python...
Ashley! This is so great! This is just the kind of thing I had in mind. A great addition! I'll follow up the details in PR #22, but thank you...