cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

SpanLabelView.addConstituent inefficient

Open cowchipkid opened this issue 7 years ago • 1 comments

Each time we add a constituent to the SpanLabelView, we sort the results after appending the new constituent. This appears to be massively inefficient for very large files. When create the text annotation from tokenized text, this sort gets done over and over and over, when in fact, it would be much more efficient if it were only done once.

cowchipkid avatar Jun 20 '17 16:06 cowchipkid

In am reprocessing my corpus of documents, and I am seeing cases where the tokenizer takes hours. Huge documents (2M), with half million tokens each, just choking the system. The tokenizer itself runs order N, but this sorting in the addConstituent method is a big problem. The fix is easy. Take it out, make this a separate method. But what does that break?

cowchipkid avatar Jun 20 '17 18:06 cowchipkid