iis icon indicating copy to clipboard operation
iis copied to clipboard

Run experiments with Grobid deployed as a server

Open marekhorst opened this issue 11 months ago • 2 comments

We could start with dividing the set of required changes into sub-topics:

  • [DONE] client code responsible for communicating with Grobid server and sending PDF contents for parsing (this is already part of https://git.icm.edu.pl/mhorst/grobid-integration-experiments)
    • it must be possible to declare raw affiliations and raw citations retrieval
  • [DONE] TEI XML document to ExtractedDocumentMetadata avro record transformer code
    • we should refine it once we are able to run the transformer on a larger number of contents
  • [DONE] integrating with IIS workflows

We need to also cover:

  • [DONE] removing specific entries from cache (e.g. the ones we need to parse with Grobid because CERMINE did not extract metadata properly) by identifying records with checksums - already addressed in #1529
  • [DONE] running cache_builder workflow on a specific set of contents (identified with the set of OAids, DOIs, source repository, provenance etc)
    • it turns out current cache_builder already provides that feature by accepting input_id parameter pointing at input datastore with set of identifiers to be approved. This means it is enough to build an avro datastore with eu.dnetlib.iis.common.schemas.Identifier records (which can be done in various ways and relying on different filtering rules: particular documents to be processed, repositories, provenance etc). Along with input_id also input_id_mapping needs to be provided. The latter points at identifier mappings between contents ids and graph ids and is produced by the InfoSpace importer module (needs to be run on the same version of the graph input_id identifiers come from).

marekhorst avatar Jan 02 '25 16:01 marekhorst

I managed to put everything together. After integrating Grobid with metadata extraction and setting up the whole workflow the job fails due to the following issue:

2025-03-21 13:36:33,338 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchFieldError: INSTANCE
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:151)
	at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:977)
	at eu.dnetlib.iis.wf.importer.HttpClientUtils.buildHttpClient(HttpClientUtils.java:29)
	at eu.dnetlib.iis.wf.metadataextraction.grobid.GrobidClient.<init>(GrobidClient.java:50)
	at eu.dnetlib.iis.wf.metadataextraction.MetadataExtractorMapper.setup(MetadataExtractorMapper.java:171)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

what indicates dependencies clash between the ones provided by the user as a part of the oozie packages and the ones from the environment (sharelibs, parcels etc). It was proven after adding the following debug code in the HttpClientUtils class:

System.out.println(HttpClientUtils.class.getClassLoader().getResource("org/apache/http/conn/ssl/SSLConnectionSocketFactory.class"));
System.out.println(HttpClientUtils.class.getClassLoader().getResource("org/apache/http/impl/client/HttpClientBuilder.class"));
System.out.println(HttpClientUtils.class.getClassLoader().getResource("org/apache/http/conn/ssl/AllowAllHostnameVerifier.class"));

which produced the following output:

jar:file:/data/1/yarn/nm/filecache/712301/httpclient-4.5.12.jar!/org/apache/http/conn/ssl/SSLConnectionSocketFactory.class
jar:file:/data/1/yarn/nm/filecache/712301/httpclient-4.5.12.jar!/org/apache/http/impl/client/HttpClientBuilder.class
jar:file:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/httpclient-4.2.5.jar!/org/apache/http/conn/ssl/AllowAllHostnameVerifier.class

so even though AllowAllHostnameVerifier is part of the httpclient-4.5.12.jar this particular class was loaded from a different package.

Possible solutions:

  • loading user dependencies first (it should be possible also with MapReduce)
  • align on the httpclient dependency version by downgrading the one used in IIS package (might break other places e.g. WebCrawlerJob so it should be tested)
  • rely on dependency shading

marekhorst avatar Mar 21 '25 14:03 marekhorst

Possible solutions:

  • loading user dependencies first (it should be possible also with MapReduce)

In the end this turned out to be enough by simply defining the following property in the metadata extraction workflow definition:

            <property>
                <name>mapreduce.job.user.classpath.first</name>
                <value>true</value>
            </property>

marekhorst avatar Mar 21 '25 15:03 marekhorst