sanskrit_parser Metrics for evaluating performance of lexical/morphological analyzer

Need to develop metrics for evaluating performance of the analyzers. This would be useful if we were trying to choose between databases for looking up tags/ different approaches for lexical/morphological analysis

From https://github.com/kmadathil/sanskrit_parser/issues/82#issuecomment-356168883

Perhaps precision can be defined as the %pass in the UoHD test suite. Perhaps Recall would mean some sort of check if all the reported splits will result in the input sentence after a join?

This would be a good start. Currently we do not pay much attention to the number of pass/fail etc in the test suite. My concern is that the UoHD dataset entries are not broken down to simple roots, and we are using the splitter to split them until we get words that are in the db (as discussed before - https://github.com/kmadathil/sanskrit_parser/issues/19#issuecomment-315433433). I am not sure that this will give us an accurate representation of the performance.

We should start looking into the DCS database to see if it is more appropriate. E.g. for the Level 1 database/tag lookups, we could perhaps just start with the roots provided in the DCS database and see how many are identifiable using the level 1/tag lookup db. We can then start building the tests up to the lexical/morphological levels.

Jan 09 '18 06:01 avinashvarna

First results from a "quick and dirty" script I wrote to evaluate word lookup accuracy (recall, if you will): The script goes through the DCS database, and for every word tagged as a single word (i.e. no samAsa/sandhi), it checks if the word is recognized as a valid word by the two level 1 lookup options.

Inria lookup recognized 1447362 out of 2333485 words
Sanskrit data recognized 1735547 out of 2333485 words

At a first pass, looks like the sanskrit data based lookup recognized about 300k more words. I think it is definitely worthwhile to move to it. As we incorporate more and more of the Inria db into it, it will always be the better choice from a recall perspective.

It may look like the overall accuracy is quite low, but there are two mitigating factors:

Due to this issue, some words in the DCS which are samastapadas are currently seen as akhandapadas, which reduces the overall accuracy.
kriyApadas with upasargas are actually stored in the DCS as one word E.g. vyAkhyAsyAmaH. These cannot be recognized by the L1 lookup in our setting. So the actual accuracy may be somewhat higher.

Next steps:

Need to ensure that the tags from the lookup contains the annotation in the DCS.
We could look at a measure of avg. precision (= 1/no. candidates retrieved for each lookup). In the sanskrit data based approach, false retrieval is possible, because we try to predict if a form can be arrived at using the anta. Or we could say that a high recall with low precision may be ok for the form lookup as higher layers will filter the incorrect forms out.
Repeat the above for the lexical and morphological analyzer. (Will need to handle the upasarga problem at this stage).

I will clean up my "quick and dirty" script to make it more amenable for the next steps and check it in by the weekend.

Jan 16 '18 02:01 avinashvarna

I have added some metrics for word level accuracy on the sanskrit_util branch here - https://github.com/kmadathil/sanskrit_parser/tree/sanskrit_util/metrics

I have also started working on evaluating lexical split accuracy using the dataset as part of the project referred to in #85 . Currently planning to use the BLEU score or chrF score (from machine translation literature) to evaluate the accuracy of these splits. Please let me know if there are any other ideas for evaluating accuracy

Mar 09 '18 07:03 avinashvarna

I concur

Mar 09 '18 16:03 kmadathil

Scripts for evaluating lexical split accuracy added to scoring branch here - https://github.com/kmadathil/sanskrit_parser/blob/scoring/metrics/lexical_split_scores.py

May 25 '18 01:05 avinashvarna

Adding an use case where scoring may help resolve the best split below. Can the tool choose [kaH, cit, naraH, vA, nArI] as the best output?

> python -m sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer "kaScit naraH vA nArI" --debug --split
Input String: kaScit naraH vA nArI                                                                                                                             
Input String in SLP1: kaScit naraH vA nArI                                                                                                                     
Start Split
End DAG generation                                                                                                                                             
End pathfinding 1527393212.680358                                                                                                                              
Splits:
[kaH, cit, naraH, vAna, arI]                                                                                                                                   
[kaH, cit, naraH, vAH, nArI]                                                                                                                                   
[kaH, cit, naraH, vA, nArI]                                                                                                                                    
[kaH, cit, na, raH, vAna, arI]                                                                                                                                 
[kaH, cit, naraH, vAH, na, arI]                                                                                                                                
[kaH, cit, naraH, vA, na, arI]                                                                                                                                 
[kaH, cit, na, raH, vAH, nArI]                                                                                                                                 
[kaH, cit, naraH, vA, AnA, arI]                                                                                                                                
[kaH, cit, na, raH, vA, nArI]                                                                                                                                  
[kaH, cit, naraH, vA, A, nArI]
-----------
Performance
Time for graph generation = 0.024774s
Total time for graph generation + find paths = 0.032885s

Jun 01 '18 13:06 codito

I worked a lot on this problem, and can vouch that https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words/11642687 is the best solution around.

All we need is a frequency count for lexemes. https://github.com/drdhaval2785/samasasplitter/issues/3#issuecomment-312500848 is where some idea about frequencies will be got

Jun 01 '18 14:06 drdhaval2785

@codito - Not sure how the whitespace problem and this issue are related? This is about evaluating accuracy, is it not. Your issue is picking one split over another.

Jun 02 '18 05:06 kmadathil

I thought this issue also tracks using a score to ensure the most likely split gets higher priority in the output. Please ignore if I confused two different things.

Jun 02 '18 05:06 codito

An Automatic Sanskrit Compound Processing

automatics

anil.pdf

How would you classify the approach?

Apr 02 '21 21:04 gasyoun

sanskrit_parser sanskrit_parser copied to clipboard

Metrics for evaluating performance of lexical/morphological analyzer

sanskrit_parser
sanskrit_parser copied to clipboard