yake icon indicating copy to clipboard operation
yake copied to clipboard

Truncating long documents

Open juhoinkinen opened this issue 3 years ago • 5 comments

Hi, I found out that when using YAKE for long documents, it can be advantageous to truncate them in advance.

We have a test set of theses and dissertations (766 documents of on average 196k characters, 22k words), and when those documents are used as a gold standard for evaluation of YAKE (or its integration in our application), a F1@5 score of 0.29 is reached. However, if the documents are first truncated to a fixed length of 15000 characters, a better score 0.33 is reached.

Being such a simple way to possibly improve results, maybe a parameter/option for truncating input text could be added directly to YAKE? Or, better yet, could the term position feature be tuned to be better suited for long texts? To somehow make it to give even more importance to the beginning part?

juhoinkinen avatar May 17 '21 14:05 juhoinkinen

@juhoinkinen ,

I also think it is an issue as it has T_position, which is based on the Indices of the sentences a term was found in, with the hypothesis that the most important words appear at the top of the document.

So, any term appearing more frequently towards the end of the document like "metrics", "accuracy", "precision", such terms in an ML-based research paper, mainly will appear towards the end and will get a lower score.

But, how do you plan to merge the lists of Keywords we get from the segmented documents??

prateekkrjain avatar Jul 27 '21 11:07 prateekkrjain

Hi @juhoinkinen and @prateekkrjain. Interesting topic to be discussed @rncampos.

arianpasquali avatar Jan 13 '22 03:01 arianpasquali

Hi @juhoinkinen.

Providing the parameter for truncating is something to consider. Would you be willing to suggest a PR for that?

arianpasquali avatar Jan 13 '22 03:01 arianpasquali

@prateekkrjain

In this case I would probably break the document and manage the sections separately.

arianpasquali avatar Jan 13 '22 03:01 arianpasquali

Providing the parameter for truncating is something to consider. Would you be willing to suggest a PR for that?

At the moment I can't, but if I have more time at some point I could take a look at this.

juhoinkinen avatar Jan 14 '22 08:01 juhoinkinen