Tom Morris

Results 157 issues of Tom Morris

Fixes #136 This is a proof of concept port to Apache POI. It has only received VERY LIMITED testing and I'm not a picture uploader, so someone else who actually...

Fixes #6595. Also fixes #6527 Changes proposed in this pull request: - refactor InputStreamReader handling to DRY up handling of null encoding and our fake UTF-8 + BOM encoding (#6527)...

Type: Bug
Priority: Critical
encoding
import

We invented a private encoding, "UTF-8-BOM", to handle the wonky Microsoft format, but because it's not listed in the standard Java characters sets, it's not available in the manual selection...

Type: Bug
Theme: UX/Usability
encoding

This fixes the three issues mentioned above: - #27 - Allowing deeply nested document to be processed as well as speeding up processing in general. Rather than continually backtracking to...

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it...

enhancement

The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.

The conditional here is wrong: https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350 causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head...

The text normalization in [Utils.normalize() ](https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/Utils.java#L117) seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream...

enhancement

Comparing these two files: - /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt - /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt It appears that the Python program is dropping ` ` entities, but not decoding some other such as `<`. The gold standard doesn't...

bug

Attempts to process this segment: s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node). The...

enhancement