headless-chrome-crawler
headless-chrome-crawler copied to clipboard
[Feature Request] Add support for WARC file format
WARC is well-known format for storing crawled captures. It can store arbitrary number of HTTP requests and responses along with other network interactions such as DNS lookups along with their header, payload, and other metadata. It is usually used by web archives, but there are some other use cases as well. WARC is the default format in which Heritrix crawler (originally developed by the Internet Archive) stores captures. Wget supports WARC format as well. There are some other tools such as WARCreate (a Chrome extension) to save web pages in WARC format along with all their page requisites while browsing and Squidwarc (a Headless Chrome-based crawler) specifically for archival purposes.
That said, adding support for WARC format will immediately make this project more useful for the web archiving community.
Thanks for filing an issue, but I don't think we should invest into this in near future. I understand web archive is of solid need, but I also believe libraries should be dedicated to that specific purpose like Heritrix crawler and Squidwarc you mentioned. I'd like to focus this crawler mainly on scraping purpose. My assumption is that people who want to take web archiving do not need to scrape HTML elements at the same time, and vice versa.
I'll keep this issue opened for a while to hear more opinions and use cases from others.
I'm interested in web archiving too and played a bit with the crawler and since it exposes puppeteer's Page it's possible to intercept the traffic and save it as WARC. I was trying to do it with node-warc, tampering a little bit with the library code to expose the WARCWriterBase class.
const HCCrawler = require('headless-chrome-crawler')
const WARC = require('node-warc')
const run = async () => {
const crawler = await HCCrawler.launch({
args: ['--disable-web-security', '--ignore-certificate-errors', '--allow-running-insecure-content'],
})
crawler.on('newpage', async page => {
await page.setRequestInterception(true)
page.on('request', async request => {
request.continue()
})
page.on('response', async response => {
const assetWriter = new WARC.WARCWriterBase()
assetWriter.initWARC('./temp_2.warc', true)
let reqHeaders = ''
for (header of Object.entries(response.request().headers())) {
reqHeaders = `${reqHeaders}${header.join(': ')}\n`
}
await assetWriter.writeRequestRecord(response.request().url(), reqHeaders, response.request().postData())
const responseData = await response.buffer()
let resHeaders = ''
for (header of Object.entries(response.headers())) {
resHeaders = `${resHeaders}${header.join(': ')}\n`
}
await assetWriter.writeResponseRecord(response.url(), resHeaders, responseData)
})
})
crawler.queue('https://google.com/')
crawler.onIdle().then(() => crawler.close())
}
run()
But the WARC that come's out isn't well formated and I'm not sure why. I'm not an expert in WARC files and crawling, and couldn't come up with a reason which it doesn't work. The library has a great API but I've found a few things that don't make sense acording to the WARC Specifications
It would be great to continue exploring this and also make a PR to close https://github.com/N0taN3rd/node-warc/issues/2
Thanks @BubuAnabelas for your investigation. I will ping @N0taN3rd here to see why node-warc
is misbehaving. However, if you have a more concrete description of the issue, you might want to create a ticket in its repository.
I'm sorry I don't have a more concrete description of the issue, nevertheless, there's the script I used which (in theory) would give you the same WARC file as mine. With that file you could try to parse it with some other tool or text editor and view the different problems you can identify.
For instance there are some records in which it seams that the request/response headers and the request data/response body are not written. It might be an asynchronous problem but that's why those await
are there and it stills happens.
I also tried disabling the page cache by adding await page.setCacheEnabled(false)
to the code but nothing changed.
Perhaps @N0taN3rd can help us and I'll contribute opening the necessary issues in node-warc
's repo