crawlers
crawlers copied to clipboard
Web Crawler 3.0: CertificateException
Hi, now I am facing a further problem since I migrated from Collector Http 2.9.0 to 3.0.0; My project is a hybrid XML-Java one;
In my hcp-config.xml I have the following fetcher configuration: ...
<httpFetchers maxRetries="3" retryDelay="3000">
<fetcher class="${httpFetcherRef}" >
<userAgent>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0</userAgent>
<validStatusCodes>200,404,500</validStatusCodes>
<!-- notFoundStatusCodes></notFoundStatusCodes -->
<forceContentTypeDetection>true</forceContentTypeDetection>
<connectionTimeout>300000</connectionTimeout>
<connectionRequestTimeout>300000</connectionRequestTimeout>
<socketTimeout>120000</socketTimeout>
<!-- cookiesDisabled>false</cookiesDisabled -->
<trustAllSSLCertificates>true</trustAllSSLCertificates>
<disableSNI>true</disableSNI>
<disableHSTS>true</disableHSTS>
<expectContinueEnabled>true</expectContinueEnabled>
<sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
<!-- if needs
<proxySettings>
<host><name>proxy.regione.lazio.it<name><port>8080</port></host>
<scheme>http</scheme>
</proxySettings>
-->
<maxRedirects>5</maxRedirects>
<headers>
<!-- base64(admin) : md5(Leonardo.2019) -->
<header name="Authorization">HCP YWRtaW4=:c0665914ae827af33a1d05dad99a0c4c</header>
<header name="Connection">keep-alive</header>
<header name="Content-Type">application/json</header>
<header name="Accept">application/json</header>
</headers>
...
and in my Java class I have the following code, that reads the configurations and prepare the HttpCollector, crawler, and fetcher objects: ...
HttpCollectorConfig collectorConfig = new HttpCollectorConfig();
HttpCollector collector = null;
/** loading config From Xml file */
try {
Tenant t = buildFrom(tid);
@SuppressWarnings("unused")
ClassLoader classLoader = this.getClass().getClassLoader();
File myXMLFile = new File(cfg.getBaseDir() + "/hcp-config.xml");
File myVariableFile = new File(cfg.getBaseDir() + "/hcp-config.variables");
/* XML configuration for crawler 3.0.0
* see: https://opensource.norconex.com/docs/crawlers/web/getting-started.html#java-usage
*/
ConfigurationLoader loader = new ConfigurationLoader();
// to capture XML Validation errors at loadFromXML-time
List<XMLValidationError> list = new ArrayList<XMLValidationError>();
ErrorHandlerCapturer capturer = new ErrorHandlerCapturer(list);
loader.setVariablesFile(myVariableFile.toPath());
collector = new HttpCollector(collectorConfig);
loader.loadFromXML(myXMLFile.toPath(), collectorConfig, capturer);
capturer.getErrors().forEach(error -> log.error("loadFromXML XML Validation Error: "
+ error.getSeverity().name() +", "+ error.getMessage()));
List<CrawlerConfig> crawlers = collectorConfig.getCrawlerConfigs();
HttpCrawlerConfig crawlerCfg = (HttpCrawlerConfig) crawlers.get(0);
crawlerCfg.setOrphansStrategy(CrawlerConfig.OrphansStrategy.IGNORE);
crawlerCfg.setStartURLsFiles(urlsFilesByTenant(t.getTid() ));
GenericHttpFetcher fetcher = (GenericHttpFetcher) crawlerCfg.getHttpFetchers().get(0);
GenericHttpFetcherConfig fetcherConfig = fetcher.getConfig();
fetcherConfig.setRequestHeader("Authorization",
"HCP "+ t.getUsr().trim()+":" +t.getPwd().trim() ); //base64:md5
String currentWorkDir= collectorConfig.getWorkDir().toString().trim();
String replacedWorkDir= currentWorkDir.replaceAll("tenantId_DOCS", tid.trim()+"_DOCS");
collectorConfig.setWorkDir( new File(replacedWorkDir.trim()).toPath() );
//also, add the Authorization string as a Idol field
ImporterConfig importerConfig = crawlerCfg.getImporterConfig();
List<IImporterHandler> erda = importerConfig.getPostParseHandlers();
for (IImporterHandler tagger : erda) {
if (tagger.getClass().getName()
.equals(ConstantTagger.class.getName())) {
ConstantTagger cTagger= (ConstantTagger)tagger;
cTagger.addConstant("Authorization",
"HCP "+ t.getUsr().trim()+":" +t.getPwd().trim()
);
}
}
} catch (IOException e) {
//e.printStackTrace();
log.error(e); // log4j
return "{\"errore: " +e.getLocalizedMessage() +"\"}";
} catch (Exception e) {
//e.printStackTrace();
log.error(e); // log4j
return e.getLocalizedMessage();
}
// already instanced above
Long started = System.currentTimeMillis();
try {
if (!collector.isRunning())
collector.start(); //<< ------- Here gets the CertificateException
} catch(Exception e) {
...
When collector.start(), for every URL in a Urls file, it returns these exceptions: com.norconex.collector.http.fetch.HttpFetchException: Could not fetch document: https://data.t2.hcp.vm733.sicotetb.it/rest/(BAD%20BROTHERS)%20MALLARDO%20-%20sequestro%2006%202013_1653053569487.pdf ... Caused by: javax.net.ssl.SSLHandshakeException: No name matching data.t2.hcp.vm733.sicotetb.it found ... Caused by: java.security.cert.CertificateException: No name matching data.t2.hcp.vm733.sicotetb.it found
where URL examples is:
- https://data.t2.hcp.vm733.sicotetb.it/rest/(BAD%20BROTHERS)%20MALLARDO%20-%20sequestro%2006%202013_1653053569487.pdf
What is wrong? [Log attached]
It indicates a problem with the certificate of the site you are targetting. Some typical certificate errors are when the certificate does not match the name it is associated with, or it is not recognized by popular certificate authorities, etc.
I cannot access the URL you shared so I can't help you there. Can you reproduce the error with a public URL you can share?
I tried with a public document URL: https://raw.githubusercontent.com/papers-we-love/papers-we-love/master/artificial_intelligence/3-bayesian-network-inference-algorithm.pdf
and the problem (CertificateException) does not happen. It only appear with a private HCP Web-Https FileSystem, whose domain name is 'hcp.vm733.sicotetb..it', and whose 'namespaces' prefixes could be: data.t2., ind.t2., and so on
Is there a way to configure the GenericHttpFetcher's SSLContext in order to avoid ServerName verification against Java cacerts?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.