crawlers icon indicating copy to clipboard operation
crawlers copied to clipboard

Web Crawler 3.0: CertificateException

Open ciroppina opened this issue 2 years ago • 3 comments

Hi, now I am facing a further problem since I migrated from Collector Http 2.9.0 to 3.0.0; My project is a hybrid XML-Java one;

In my hcp-config.xml I have the following fetcher configuration: ...

<httpFetchers maxRetries="3" retryDelay="3000">
	<fetcher class="${httpFetcherRef}" >
		<userAgent>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0</userAgent>
		<validStatusCodes>200,404,500</validStatusCodes>
		<!-- notFoundStatusCodes></notFoundStatusCodes -->
		<forceContentTypeDetection>true</forceContentTypeDetection>
		<connectionTimeout>300000</connectionTimeout>
		<connectionRequestTimeout>300000</connectionRequestTimeout>
		<socketTimeout>120000</socketTimeout>
		<!-- cookiesDisabled>false</cookiesDisabled -->
		<trustAllSSLCertificates>true</trustAllSSLCertificates>
		<disableSNI>true</disableSNI>
		<disableHSTS>true</disableHSTS>
		<expectContinueEnabled>true</expectContinueEnabled>
		<sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
		<!-- if needs
		<proxySettings>
			<host><name>proxy.regione.lazio.it<name><port>8080</port></host>
			<scheme>http</scheme>
		</proxySettings>
		-->
		<maxRedirects>5</maxRedirects>
		
		<headers>
			<!-- base64(admin) : md5(Leonardo.2019) -->
			<header name="Authorization">HCP YWRtaW4=:c0665914ae827af33a1d05dad99a0c4c</header>
			<header name="Connection">keep-alive</header>
			<header name="Content-Type">application/json</header>
			<header name="Accept">application/json</header>
		</headers>

...

and in my Java class I have the following code, that reads the configurations and prepare the HttpCollector, crawler, and fetcher objects: ...

	HttpCollectorConfig collectorConfig = new HttpCollectorConfig();
	HttpCollector collector = null;
	
	/** loading config From Xml file */
	try {
		Tenant t = buildFrom(tid);

		@SuppressWarnings("unused")
		ClassLoader classLoader = this.getClass().getClassLoader();
		
		File myXMLFile = new File(cfg.getBaseDir() + "/hcp-config.xml");
		File myVariableFile = new File(cfg.getBaseDir() + "/hcp-config.variables");
		
		/* XML configuration for crawler 3.0.0 
		 * see: https://opensource.norconex.com/docs/crawlers/web/getting-started.html#java-usage
		 */
		ConfigurationLoader loader = new ConfigurationLoader();
		// to capture XML Validation errors at loadFromXML-time
		List<XMLValidationError> list = new ArrayList<XMLValidationError>();
		ErrorHandlerCapturer capturer = new ErrorHandlerCapturer(list);
		loader.setVariablesFile(myVariableFile.toPath());
		collector = new HttpCollector(collectorConfig);
		loader.loadFromXML(myXMLFile.toPath(), collectorConfig, capturer);
		
		capturer.getErrors().forEach(error -> log.error("loadFromXML XML Validation Error: " 
                     + error.getSeverity().name() +", "+ error.getMessage()));

		List<CrawlerConfig> crawlers = collectorConfig.getCrawlerConfigs();
		HttpCrawlerConfig crawlerCfg = (HttpCrawlerConfig) crawlers.get(0);
		crawlerCfg.setOrphansStrategy(CrawlerConfig.OrphansStrategy.IGNORE);
		crawlerCfg.setStartURLsFiles(urlsFilesByTenant(t.getTid() ));
		
		GenericHttpFetcher fetcher = (GenericHttpFetcher) crawlerCfg.getHttpFetchers().get(0);
		GenericHttpFetcherConfig fetcherConfig = fetcher.getConfig();
		fetcherConfig.setRequestHeader("Authorization", 
			"HCP "+ t.getUsr().trim()+":" +t.getPwd().trim() ); //base64:md5
		
		String currentWorkDir= collectorConfig.getWorkDir().toString().trim();
		String replacedWorkDir= currentWorkDir.replaceAll("tenantId_DOCS", tid.trim()+"_DOCS");
		collectorConfig.setWorkDir( new File(replacedWorkDir.trim()).toPath() );
		
		//also, add the Authorization string as a Idol field
		ImporterConfig importerConfig = crawlerCfg.getImporterConfig();
		List<IImporterHandler> erda = importerConfig.getPostParseHandlers();
		for (IImporterHandler tagger : erda) {
			if (tagger.getClass().getName()
				.equals(ConstantTagger.class.getName())) {
				ConstantTagger cTagger= (ConstantTagger)tagger;
				cTagger.addConstant("Authorization", 
					"HCP "+ t.getUsr().trim()+":" +t.getPwd().trim()
				);
			}
		}
	} catch (IOException e) {
		//e.printStackTrace();
		log.error(e); // log4j
		return "{\"errore: " +e.getLocalizedMessage() +"\"}";
	} catch (Exception e) {
		//e.printStackTrace();
		log.error(e); // log4j
		return e.getLocalizedMessage();
	}
	
	// already instanced above
	Long started = System.currentTimeMillis();
	try {
		if (!collector.isRunning())
			collector.start();  //<< ------- Here gets the CertificateException
	} catch(Exception e) {

...

When collector.start(), for every URL in a Urls file, it returns these exceptions: com.norconex.collector.http.fetch.HttpFetchException: Could not fetch document: https://data.t2.hcp.vm733.sicotetb.it/rest/(BAD%20BROTHERS)%20MALLARDO%20-%20sequestro%2006%202013_1653053569487.pdf ... Caused by: javax.net.ssl.SSLHandshakeException: No name matching data.t2.hcp.vm733.sicotetb.it found ... Caused by: java.security.cert.CertificateException: No name matching data.t2.hcp.vm733.sicotetb.it found

where URL examples is:

  • https://data.t2.hcp.vm733.sicotetb.it/rest/(BAD%20BROTHERS)%20MALLARDO%20-%20sequestro%2006%202013_1653053569487.pdf

What is wrong? [Log attached]

ciroppina avatar Aug 18 '22 20:08 ciroppina

It indicates a problem with the certificate of the site you are targetting. Some typical certificate errors are when the certificate does not match the name it is associated with, or it is not recognized by popular certificate authorities, etc.

I cannot access the URL you shared so I can't help you there. Can you reproduce the error with a public URL you can share?

essiembre avatar Aug 19 '22 04:08 essiembre

I tried with a public document URL: https://raw.githubusercontent.com/papers-we-love/papers-we-love/master/artificial_intelligence/3-bayesian-network-inference-algorithm.pdf

and the problem (CertificateException) does not happen. It only appear with a private HCP Web-Https FileSystem, whose domain name is 'hcp.vm733.sicotetb..it', and whose 'namespaces' prefixes could be: data.t2., ind.t2., and so on

Is there a way to configure the GenericHttpFetcher's SSLContext in order to avoid ServerName verification against Java cacerts?

ciroppina avatar Aug 20 '22 21:08 ciroppina

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 19 '22 22:10 stale[bot]