estnltk icon indicating copy to clipboard operation
estnltk copied to clipboard

Implement Flesch reading-ease test

Open kristiank opened this issue 6 years ago • 14 comments

There exists many reading-ease tests that calculate how easy it is to "read" and "understand" a text. Flesch is one of the most popular.

I have used EstNLTK for calculating the needed variables for the Flesch reading ease score. I wold now want to implement this as a feature of EstNLTK. What are my options, should I concentrate on version 1.4 or 1.6? Is there documentation meant for module implementers for EstNLTK somewhere?

kristiank avatar Dec 19 '18 13:12 kristiank

Since the reading-ease score applies to either sentences or whole texts, one could imagine also tagging the score on the sentence and text layers. Which would be preferred, to store the scores on the layers or simply have a separate function that calculates and returns the score for an inputted Text element?

kristiank avatar Dec 19 '18 14:12 kristiank

For the EstNLTK development version devel_1.6 is the best choice. There are some tutorials and since this branch is in active development, it is also supported in case issues arise. Alas, the installation is more complicated.

For a starting point you can use dummy SentenceFleschScoreRetagger example.

paultammo avatar Dec 20 '18 08:12 paultammo

I agree with Paul that it would be great, if you could use the version that is most actively developed. Although, this means that you have to download version devel_1.6 source and build & install estnltk from the source. And there is also some extra learning to do, because API in version 1.6 is different from that in version 1.4. But if you choose this path, I'll be happy to help with any issues related to building the library or using the new API.

Using the version 1.4 is also ok. Version 1.4 does not have a common interface for Text taggers, but you can examine the code of the Text class and some existing taggers to get an idea how dependency management works and how output layers could be added. NounPhraseChunker is one example, as it works as a stand-alone tagger, not integrated as a callable from Text object.

The question about whether to calculate the score for sentences or for whole texts. Can I use sentence scores to easily calculate the score for the whole text? If not, then I would probaly prefer a method for scoring a Text object, because I can use split_by method to split Text object into sentence Text objects and then calculate their scores one-by-one. Although, it would also be nice to have scores for both sentences and the text, like in Paul's example.

soras avatar Dec 20 '18 09:12 soras

@paultammo

For a starting point you can use dummy SentenceFleschScoreRetagger example.

I have the latest EstNLTK installed via conda. I am unable to reproduce your example code. I get the following error:

sentence_flech_score_retagger.retag(text)

~/miniconda3/envs/flesch16/lib/python3.6/site-packages/estnltk/taggers/retagger.py in retag(self, text, status)
     44             This can be used to store metadata on layer creation.
     45         """
---> 46         self._change_layer(text, status)
     47         return text
     48 

TypeError: _change_layer() missing 1 required positional argument: 'status'

When looking at the code for retagger.py I understand I can simply give None as status. Is this correct?

Changing the code to sentence_flech_score_retagger.retag(text, None) doesn't change anything and the same error message is shown.

kristiank avatar Jan 11 '19 11:01 kristiank

Sorry for rushing, I now understand that the retag in my EstNLTK 1.6.2beta installed from conda uses a signature without layers (e.g. retag(self, text, status) instead of retag(slf, text, layers, status)) but the code you posted seems to depend on retag(slf, text, layers, status).

kristiank avatar Jan 11 '19 11:01 kristiank

My example code works with devel_1.6 branch. It has had significant changes since the latest conda install was released so the best way to make it work is to git pull devel_1.6 branch.

paultammo avatar Jan 11 '19 11:01 paultammo

But both branches have the same retag function. Branch devel_1.6 has:

def retag(self, text: Text, status: dict = None, 
                check_output_consistency: bool=True ) -> Text:

and branch version_1.6 has:

def retag(self, text: Text, status: dict = None) -> Text:

kristiank avatar Jan 11 '19 12:01 kristiank

@paultammo I now got it working. Your naming of argument raw_text was a bit misleading, when I now interpret it as text (e.g. being of type Text) and a bit or rewriting I got it running and I think I can implement the rest by myself now. Thank you.

kristiank avatar Jan 11 '19 12:01 kristiank

@soras I can't find this simple thing in the tutorials: how to loop through the nested layers?

I currently try this

text = Text("esimene lause. teine lause. kolmas pikem lause.")
text.tag_analysis()
for sentence in text.sentences.layer:
            for token in sentence.layer:
                analysises = token.get_attributes(['morph_analysis'])
                for analysis in analysises:
                    print(type(analysis))

The type of analysis is a python List. What should I do to access the analysises of a token?

kristiank avatar Jan 15 '19 08:01 kristiank

The most direct way of accessing morph is by using attributes:

text = Text("esimene lause. teine lause. kolmas pikem lause.")
text.tag_layer()
for sentence in text.sentences:
    for word in sentence:
        # Full analysis of the word:
        print(word.morph_analysis)

        # Lists of specific elements of morph:
        print(word.morph_analysis.text)
        print(word.morph_analysis.lemma)
        print(word.morph_analysis.partofspeech)
        print(word.morph_analysis.form)

        # First lemma in word's analyses:
        print(word.morph_analysis.lemma[0])
        # is same as the lemma of the first analysis:
        print(word.morph_analysis[0].lemma)

        print()

This tutorial gives some additional examples.

soras avatar Jan 15 '19 09:01 soras

Here's a first go on this https://gist.github.com/kristiank/98202214e016448979fcefdb3d745598#file-flesch_retagger-ipynb

kristiank avatar Jan 15 '19 10:01 kristiank

@soras and @paultammo what is the standard procedure to check whether a layer has been tagged/analyzed? I would like the Flesch Retagger's retag(text) to simply take a Text object as input and if the sentences and morph_analysis hasn't been done yet, it would do this by itself.

kristiank avatar Jan 15 '19 11:01 kristiank

if 'sentences' not in text.layers:
    text.tag_layer(['sentences'])

paultammo avatar Jan 15 '19 11:01 paultammo

I now created an initial pull request #106 for the code. I wasn't really sure where to place the code files, but maybe we can discuss that in the PR?

kristiank avatar Jan 23 '19 08:01 kristiank