sanskrit_parser
sanskrit_parser copied to clipboard
Metrics for evaluating performance of lexical/morphological analyzer
Need to develop metrics for evaluating performance of the analyzers. This would be useful if we were trying to choose between databases for looking up tags/ different approaches for lexical/morphological analysis
From https://github.com/kmadathil/sanskrit_parser/issues/82#issuecomment-356168883
Perhaps precision can be defined as the %pass in the UoHD test suite. Perhaps Recall would mean some sort of check if all the reported splits will result in the input sentence after a join?
This would be a good start. Currently we do not pay much attention to the number of pass/fail etc in the test suite. My concern is that the UoHD dataset entries are not broken down to simple roots, and we are using the splitter to split them until we get words that are in the db (as discussed before - https://github.com/kmadathil/sanskrit_parser/issues/19#issuecomment-315433433). I am not sure that this will give us an accurate representation of the performance.
We should start looking into the DCS database to see if it is more appropriate. E.g. for the Level 1 database/tag lookups, we could perhaps just start with the roots provided in the DCS database and see how many are identifiable using the level 1/tag lookup db. We can then start building the tests up to the lexical/morphological levels.
First results from a "quick and dirty" script I wrote to evaluate word lookup accuracy (recall, if you will): The script goes through the DCS database, and for every word tagged as a single word (i.e. no samAsa/sandhi), it checks if the word is recognized as a valid word by the two level 1 lookup options.
Inria lookup recognized 1447362 out of 2333485 words
Sanskrit data recognized 1735547 out of 2333485 words
At a first pass, looks like the sanskrit data based lookup recognized about 300k more words. I think it is definitely worthwhile to move to it. As we incorporate more and more of the Inria db into it, it will always be the better choice from a recall perspective.
It may look like the overall accuracy is quite low, but there are two mitigating factors:
- Due to this issue, some words in the DCS which are samastapadas are currently seen as akhandapadas, which reduces the overall accuracy.
- kriyApadas with upasargas are actually stored in the DCS as one word E.g. vyAkhyAsyAmaH. These cannot be recognized by the L1 lookup in our setting. So the actual accuracy may be somewhat higher.
Next steps:
- Need to ensure that the tags from the lookup contains the annotation in the DCS.
- We could look at a measure of avg. precision (= 1/no. candidates retrieved for each lookup). In the sanskrit data based approach, false retrieval is possible, because we try to predict if a form can be arrived at using the anta. Or we could say that a high recall with low precision may be ok for the form lookup as higher layers will filter the incorrect forms out.
- Repeat the above for the lexical and morphological analyzer. (Will need to handle the upasarga problem at this stage).
I will clean up my "quick and dirty" script to make it more amenable for the next steps and check it in by the weekend.
I have added some metrics for word level accuracy on the sanskrit_util branch here - https://github.com/kmadathil/sanskrit_parser/tree/sanskrit_util/metrics
I have also started working on evaluating lexical split accuracy using the dataset as part of the project referred to in #85 . Currently planning to use the BLEU score or chrF score (from machine translation literature) to evaluate the accuracy of these splits. Please let me know if there are any other ideas for evaluating accuracy
I concur
Scripts for evaluating lexical split accuracy added to scoring branch here - https://github.com/kmadathil/sanskrit_parser/blob/scoring/metrics/lexical_split_scores.py
Adding an use case where scoring may help resolve the best split below. Can the tool choose [kaH, cit, naraH, vA, nArI]
as the best output?
> python -m sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer "kaScit naraH vA nArI" --debug --split
Input String: kaScit naraH vA nArI
Input String in SLP1: kaScit naraH vA nArI
Start Split
End DAG generation
End pathfinding 1527393212.680358
Splits:
[kaH, cit, naraH, vAna, arI]
[kaH, cit, naraH, vAH, nArI]
[kaH, cit, naraH, vA, nArI]
[kaH, cit, na, raH, vAna, arI]
[kaH, cit, naraH, vAH, na, arI]
[kaH, cit, naraH, vA, na, arI]
[kaH, cit, na, raH, vAH, nArI]
[kaH, cit, naraH, vA, AnA, arI]
[kaH, cit, na, raH, vA, nArI]
[kaH, cit, naraH, vA, A, nArI]
-----------
Performance
Time for graph generation = 0.024774s
Total time for graph generation + find paths = 0.032885s
I worked a lot on this problem, and can vouch that https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words/11642687 is the best solution around.
All we need is a frequency count for lexemes. https://github.com/drdhaval2785/samasasplitter/issues/3#issuecomment-312500848 is where some idea about frequencies will be got
@codito - Not sure how the whitespace problem and this issue are related? This is about evaluating accuracy, is it not. Your issue is picking one split over another.
I thought this issue also tracks using a score to ensure the most likely split gets higher priority in the output. Please ignore if I confused two different things.