[Question] Where should i set the content obtained from http request ?

Open naveen17797 opened this issue 4 years ago • 1 comments

I am extending this module of heritrix org.archive.modules.fetcher.FetchHTTP and overriding the innerProcess method to make a headless browser get the content instead of the builtin heritrix http request


    @Override
    protected void innerProcess(CrawlURI curi) throws InterruptedException { }

i read through the source of FetchHTTP module, but unable to figure out where this method actually sets the content obtained from the request.

    protected void addResponseContent(HttpResponse response, CrawlURI curi) {
        curi.setFetchStatus(response.getStatusLine().getStatusCode());
        Header ct = response.getLastHeader("content-type");
        curi.setContentType(ct == null ? null : ct.getValue());
        
        for (Header h: response.getAllHeaders()) {
            curi.putHttpResponseHeader(h.getName(), h.getValue());
        }
    }

the above method is called when the http request status is success, here i couldnt find any setters to set the content obtained from a URL ( for example, a html page ).

How can i set the html content, so that heritrix can proceed to extract the links from it ?

Sep 23 '21 10:09 naveen17797

Assuming your content is supplied by a InputStream called stream then something like this will probably work:

Recorder recorder = curi.getRecorder();
recorder.markContentBegin();
recoredr.inputWrap(stream);
recorder.getRecordedInput().readFully();
recorder.closeRecorders();

handleCapturedRequest() in ExtractorChrome may be a relevant example of integrating Heritrix with a headless browser. Although keep in mind that's for recording subrequests on a background thread and so has to jump through a lot more hoops. Whereas since since you're writing a Fetch processor you don't have to setup your own recorder and can use the one already supplied by the ToeThread and similarly don't need to call the extractors yourself.

Sep 23 '21 10:09 ato