Ilya Kreymer
Ilya Kreymer
Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...)
> What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/...
Yes, I think this is mostly a documentation issue. capture_http() intercepts and writes what is actually loaded over the HTTP connection in the background. > ``` > with capture_http('example.warc.gz'): >...
Thanks for this - i think it would be clearer if it was part of the existing `--screenshot` flag, perhaps called `fullPageAfterBehaviors` - do you mind refactoring it to use...
Thinking should rename it to fullPageFinal instead of fullPageAfterBehaviors, since behaviors may not run in some circumstances, and this is more consistent with `final-to-warc` setting with have for text
> Could we also add the page id to the data pushed to Redis, just to help with matching in Browsertrix? The QA data is now merged with the page...
Thanks for reporting, I think what maybe is happening is that the request is being saved as truncated: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/recorder.ts#L1544 https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/recorder.ts#L1736 But perhaps that's not what we want, especially with async...
Hm, most of this can be solved with browser profiles, which offer a more user-friendly interface than tracking custom headers, especially for cookies. I suppose if there's a reason to...
> This function is a must. We're talking about many terrabytes of data. Storage should be flexible and configurable. Still, the issue has gone from ready, back to todo. >...
@pirate thanks for sharing this, potentially exciting that there is a standalone JSON format that's *not* tied to puppeteer. I wonder if there is a spec for it. There's potentially...
Should also evaluate the general applicability of this, beyond a single page. In some ways, this is similar to what [Memento Tracer](http://tracer.mementoweb.org/) was trying to do. The overall behavior system...