[Question] Where should i set the content obtained from http request ?
I am extending this module of heritrix org.archive.modules.fetcher.FetchHTTP and overriding the innerProcess method to make a headless browser get the content instead of the builtin heritrix http request
@Override
protected void innerProcess(CrawlURI curi) throws InterruptedException { }
i read through the source of FetchHTTP module, but unable to figure out where this method actually sets the content obtained from the request.
protected void addResponseContent(HttpResponse response, CrawlURI curi) {
curi.setFetchStatus(response.getStatusLine().getStatusCode());
Header ct = response.getLastHeader("content-type");
curi.setContentType(ct == null ? null : ct.getValue());
for (Header h: response.getAllHeaders()) {
curi.putHttpResponseHeader(h.getName(), h.getValue());
}
}
the above method is called when the http request status is success, here i couldnt find any setters to set the content obtained from a URL ( for example, a html page ).
How can i set the html content, so that heritrix can proceed to extract the links from it ?
Assuming your content is supplied by a InputStream called stream then something like this will probably work:
Recorder recorder = curi.getRecorder();
recorder.markContentBegin();
recoredr.inputWrap(stream);
recorder.getRecordedInput().readFully();
recorder.closeRecorders();
handleCapturedRequest() in ExtractorChrome may be a relevant example of integrating Heritrix with a headless browser. Although keep in mind that's for recording subrequests on a background thread and so has to jump through a lot more hoops. Whereas since since you're writing a Fetch processor you don't have to setup your own recorder and can use the one already supplied by the ToeThread and similarly don't need to call the extractors yourself.