Run experiments with Grobid deployed as a server
We could start with dividing the set of required changes into sub-topics:
- [DONE] client code responsible for communicating with Grobid server and sending PDF contents for parsing (this is already part of https://git.icm.edu.pl/mhorst/grobid-integration-experiments)
- it must be possible to declare raw affiliations and raw citations retrieval
- [DONE] TEI XML document to
ExtractedDocumentMetadataavro record transformer code- we should refine it once we are able to run the transformer on a larger number of contents
- [DONE] integrating with IIS workflows
We need to also cover:
- [DONE] removing specific entries from cache (e.g. the ones we need to parse with Grobid because CERMINE did not extract metadata properly) by identifying records with checksums - already addressed in #1529
- [DONE] running cache_builder workflow on a specific set of contents (identified with the set of OAids, DOIs, source repository, provenance etc)
- it turns out current cache_builder already provides that feature by accepting
input_idparameter pointing at input datastore with set of identifiers to be approved. This means it is enough to build an avro datastore witheu.dnetlib.iis.common.schemas.Identifierrecords (which can be done in various ways and relying on different filtering rules: particular documents to be processed, repositories, provenance etc). Along withinput_idalsoinput_id_mappingneeds to be provided. The latter points at identifier mappings between contents ids and graph ids and is produced by the InfoSpace importer module (needs to be run on the same version of the graphinput_ididentifiers come from).
- it turns out current cache_builder already provides that feature by accepting
I managed to put everything together. After integrating Grobid with metadata extraction and setting up the whole workflow the job fails due to the following issue:
2025-03-21 13:36:33,338 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchFieldError: INSTANCE
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:151)
at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:977)
at eu.dnetlib.iis.wf.importer.HttpClientUtils.buildHttpClient(HttpClientUtils.java:29)
at eu.dnetlib.iis.wf.metadataextraction.grobid.GrobidClient.<init>(GrobidClient.java:50)
at eu.dnetlib.iis.wf.metadataextraction.MetadataExtractorMapper.setup(MetadataExtractorMapper.java:171)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
what indicates dependencies clash between the ones provided by the user as a part of the oozie packages and the ones from the environment (sharelibs, parcels etc). It was proven after adding the following debug code in the HttpClientUtils class:
System.out.println(HttpClientUtils.class.getClassLoader().getResource("org/apache/http/conn/ssl/SSLConnectionSocketFactory.class"));
System.out.println(HttpClientUtils.class.getClassLoader().getResource("org/apache/http/impl/client/HttpClientBuilder.class"));
System.out.println(HttpClientUtils.class.getClassLoader().getResource("org/apache/http/conn/ssl/AllowAllHostnameVerifier.class"));
which produced the following output:
jar:file:/data/1/yarn/nm/filecache/712301/httpclient-4.5.12.jar!/org/apache/http/conn/ssl/SSLConnectionSocketFactory.class
jar:file:/data/1/yarn/nm/filecache/712301/httpclient-4.5.12.jar!/org/apache/http/impl/client/HttpClientBuilder.class
jar:file:/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/httpclient-4.2.5.jar!/org/apache/http/conn/ssl/AllowAllHostnameVerifier.class
so even though AllowAllHostnameVerifier is part of the httpclient-4.5.12.jar this particular class was loaded from a different package.
Possible solutions:
- loading user dependencies first (it should be possible also with MapReduce)
- align on the
httpclientdependency version by downgrading the one used in IIS package (might break other places e.g.WebCrawlerJobso it should be tested) - rely on dependency shading
Possible solutions:
- loading user dependencies first (it should be possible also with MapReduce)
In the end this turned out to be enough by simply defining the following property in the metadata extraction workflow definition:
<property>
<name>mapreduce.job.user.classpath.first</name>
<value>true</value>
</property>