cogcomp-nlp SpanLabelView.addConstituent inefficient

SpanLabelView.addConstituent inefficient

Open cowchipkid opened this issue 7 years ago • 1 comments

Each time we add a constituent to the SpanLabelView, we sort the results after appending the new constituent. This appears to be massively inefficient for very large files. When create the text annotation from tokenized text, this sort gets done over and over and over, when in fact, it would be much more efficient if it were only done once.

Jun 20 '17 16:06 cowchipkid

In am reprocessing my corpus of documents, and I am seeing cases where the tokenizer takes hours. Huge documents (2M), with half million tokens each, just choking the system. The tokenizer itself runs order N, but this sorting in the addConstituent method is a big problem. The fix is easy. Take it out, make this a separate method. But what does that break?

Jun 20 '17 18:06 cowchipkid

cogcomp-nlp cogcomp-nlp copied to clipboard

SpanLabelView.addConstituent inefficient

cogcomp-nlp
cogcomp-nlp copied to clipboard