coref
coref copied to clipboard
Sentence index when splitting long sentences into non-overlapping chunks
Hi @mandarjoshi90, thanks much for this awesome library.
Quick question - I am attempting coreference resolution on a corpus where the word count of many (tokenized) sentences is greater than max_segment_len, (say, for spanbert_base with max_segment_len = 384). I am tackling this by chunking such sentences into multiple segments by splitting them (non-overlapping).
My questions:
- Is this a valid approach? (in line with your response to another question here: https://github.com/mandarjoshi90/coref/issues/33)
- Let’s say the sentence index of a sample long sentence is X. When the tokens of this sentence are chunked between 2 segments (S1 and S2), will the sentence index for tokens in both S1 and S2 be X? Or does this need to be handled differently?
Thank you.