Performance issue with TokenCaseTransformer

Open cbretsch opened this issue 9 years ago • 0 comments

The TokenCaseTransformer seems to have a performance issue when a large number of token annotations has been changed. My case for reproduction: comparing tolower case on two files (1) 1,000 lines with each line saying 'hello' (2) 1,000 lines with each line saying 'Hello' usind a rather simple aggregate engine with two AEs OpenNlpSegmenter, which produces 1,000 tokens in both cases, and TokenCaseTransformer/LOWERCASE, which only conductes changes on the tokens for file#2. The pipeline for file#1 is processed without issues, however file#2 takes at least times 10 longer than file#1; the runtime is even longer for any file with >1,000 tokens.

During debugging I observed that the lowercase change requests are processed moothly in the process method. However, the method JCasTransformerChangeBase_Impl.afterProcess runs with an inadequate time. It seems as if the copying of annotations to the output CAS is the issue. As a consequence documents with a rather large number of tokens cannot be processed by the annotator. I needed to exclude it from my pipeline in order to complete the pipeline processing for 50,000 documents within an adequate time. pipeline.txt hello_lowercase.txt hello_uppercase.txt

Jan 04 '17 12:01 cbretsch