CERMINE Large portions of text body missing.

For example, processing example article leads to cutting over 250 pages out of ~320. While main text body of an article isn't main interest in CERMINE, having full content would be a desired default behavior.

Comparison of original and .cermxml file: Page 19 (20 with pdf counting) has a paragraph starting with "Chapter 5, 6..., after which the next paragraph is "There is also a more practical, concrete response. Dahlberg and Moss (2005: 107-110)", which is on page 311 (312 pdf count) in original PDF.

Tested with Cermine 2.12 and via web interface, same behavior.

Mar 06 '17 15:03 pafnucy

@pafnucy CERMINE indeed focuses on extracting metadata and references, and currently the default behaviour is to analyze a few first and a few last pages of the file. It would be possible to extract the entire text, but as the system currently stores the structure of the entire file in the memory while processing, there might be performance problems in the case of large files. And of course rewriting it so that the entire file is not kept in the memory is not trivial and would take time.

How are you using CERMINE: from your code, from command line using JAR file?

Mar 16 '17 12:03 dtkaczyk

Using JAR

Mar 16 '17 12:03 pafnucy

@pafnucy I too just ran into this issue, and after some digging, came up with this, which overrides the default "first 20 and last 20 page" configuration:

ComponentConfiguration config = new ComponentConfiguration();
ITextCharacterExtractor charExtractor = new ITextCharacterExtractor();
charExtractor.setPagesLimits(1000, 1000);
config.setCharacterExtractor(charExtractor);
    ContentExtractor extractor = new ContentExtractor();
    extractor.setConf(config);

Apr 27 '20 17:04 eichmann