fastcoref icon indicating copy to clipboard operation
fastcoref copied to clipboard

LingMessCoref cannot handle long texts

Open teowz46 opened this issue 1 year ago • 1 comments
trafficstars

Seems like LingMessCoref comes with a max_doc_len of 4096. Is there anyway to circumvent this for it to work on any document length?

Edit: I am mainly trying to get the character spans of clusters. I tried to overcome this limitation myself by manually chunking the document and piecing together the detected clusters. My chunks have about 4000 tokens each and have an overlap of 2000 tokens. To piece together the overall clusters, I get the coreference pairs for each chunk, and basically make a graph where these coreference pairs are edges, and each connected component is a cluster. However, this approach does not really work well because:

  • the model seems to perform worse on document chunks (especially those later in a document), probably because information in the earlier sections is already lost?
  • all it takes is one mistake from the model (e.g. in Chunk 1, the model thinks a "he" is referring to "John", but in Chunk 2, the model thinks the same "he" is referring to "Peter") for mentions of separate entities to be lumped together.

teowz46 avatar Dec 13 '23 02:12 teowz46

Also having this issue, had to fallback to FCoref.

v5out avatar Apr 24 '24 18:04 v5out