dkpro-c4corpus icon indicating copy to clipboard operation
dkpro-c4corpus copied to clipboard

O(n!) processing in tag name/path for Paragraph in dedupe code

Open tfmorris opened this issue 9 years ago • 2 comments

Attempts to process this segment:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz

stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node).

The document is pathological in that its many thousands of levels deeply nested, but it causes the entire segment to fail when the mapper gets killed.

tfmorris avatar Apr 03 '16 21:04 tfmorris

Many thanks, Tom!

Ideally, it should be tested on the benchmark data for boilerplate removal to make sure it delivers the same results.

habernal avatar Apr 04 '16 07:04 habernal

The fix needs improvement because, although it fixes the processing time issue, it can still exhaust heap in a constrained environment like a Hadoop cluster. I'm testing a revised version which doesn't keep the entire string of tag names, since it doesn't appear to be used anywhere.

I don't see any tests in the dkpro-c4corpus-boilerplate sub-project. How does one run the tests you are describing?

tfmorris avatar Apr 04 '16 14:04 tfmorris