coref icon indicating copy to clipboard operation
coref copied to clipboard

Sentence index when splitting long sentences into non-overlapping chunks

Open nikarjunagi opened this issue 3 years ago • 0 comments

Hi @mandarjoshi90, thanks much for this awesome library.

Quick question - I am attempting coreference resolution on a corpus where the word count of many (tokenized) sentences is greater than max_segment_len, (say, for spanbert_base with max_segment_len = 384). I am tackling this by chunking such sentences into multiple segments by splitting them (non-overlapping).

My questions:

  1. Is this a valid approach? (in line with your response to another question here: https://github.com/mandarjoshi90/coref/issues/33)
  2. Let’s say the sentence index of a sample long sentence is X. When the tokens of this sentence are chunked between 2 segments (S1 and S2), will the sentence index for tokens in both S1 and S2 be X? Or does this need to be handled differently?

Thank you.

nikarjunagi avatar Feb 23 '22 04:02 nikarjunagi