sanskrit_parser
sanskrit_parser copied to clipboard
Add Sandhikosh to testing
Recent publications use the sandhikosh described in this paper as a benchmark. Let's add it to our testing and see where we stand.
(Related to #84)
Since this subsumes UoHD, I think we can make this our primary test corpus for sandhi.
We need to find a corpus for parsing.
I can see some erroneous spaces (which we can remove programmatically) and clear bad splits in sandhikosh.
They don't split some samAsas (which we do) and usually do not split upasargas (which we also do).
I counted 881 passes and 549 fails on the BhagavadGitA corpus (no edits) and 428 failed, 1002 passed with automated edits to remove spaces.
Some samples, showing issues in the sandhikosh
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry949] - AssertionError: assert ['cittam', 'nirudDam', 'yogasevayA '] in [['cittam', 'nirudDam', 'yoga', 'sevayA'], ['cit', 'tat...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry950] - AssertionError: assert ['ca', 'eva', 'AtmanA', 'AtmAnam paSyan', 'Atmani'] in [['ca', 'eva', 'AtmanA', 'AtmAnam', 'paSy...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry952] - AssertionError: assert ['budDigrAhyam', 'ati', 'indriyam '] in [['budDi', 'grAhyam', 'atIndriyam'], ['budDi', 'grAhyam'...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry954] - AssertionError: assert ['sTitaH', 'calatitattvataH'] in [['sTitaH', 'calati', 'tat', 'tu', 'ataH'], ['sTitaH', 'calati'...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry959] - AssertionError: assert ['guruRA', 'api '] in [['guruRA', 'api'], ['guruRA', 'pi'], ['guruRA', 'Api'], ['guru', 'Ra', 'a...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry960] - AssertionError: assert ['tam', 'vidyAt', 'duHKasaMyogaviyogam', 'yogasaYjYitam '] in [['tam', 'vidyAt', 'duHKa', 'saMyo...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry962] - AssertionError: assert ['yoktavyaH', 'yogaH', 'anirviRRacetasA'] in [['yoktavyaH', 'yogaH', 'asni', 'ru', 'iw', 'Ra', ....
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry963] - AssertionError: assert ['sam', 'kalpapraBavAn', 'kAmAn', 'tyaktvA '] in [['sam', 'kalpa', 'praBavAn', 'kAmAn', 'tyaktvA...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry966] - AssertionError: assert ['samam', 'tataH '] in [['samam', 'tataH'], ['samantataH'], ['samam', 'tat', 'aH'], ['samam', 't...
FAI
Take a look at branch multigraph, tests/SandhiKosh.
manual_test.py
runs tests and outputs to Results.xls
. I've run for 1000 tests, with 622 passes. I will run for the full dataset next.
Updated - 11080 Tests: 8413 Passed, 1232 Failed, 1430 No_Split, 5 Bad tests
Going by the SandhiKosh paper, we are already better than the best result they report (INRIA) for the subset that I ran (BG, Literature, External, UoH).
That's quite impressive! Thanks for adding this. We can look at the failed ones to understand what's happening. I will try to spend some time on it this weekend.
Two big sources of discrepancy - SandhiKosh doesn't split some samAsas (which we do) and usually does not split upasargas which we also do, and IMO should. Both of these are proper pada boundaries.
This is where we stand on passes:
| Corpus | Total | JNU | UoH | INRIA | sanskrit_parser |
|----------------------+-------+-----+------+-------+-----------------|
| Rule based- Internal | 150 | 10 | 27 | 3 | 14 |
| Rule based- External | 132 | 22 | 48 | 38 | 41 |
| Literature | 150 | 13 | 98 | 101 | 66 |
| Bhagavad-gita | 1430 | 67 | 650 | 962 | 1002 |
| UoH | 9368 | 934 | 6393 | 6490 | 7304 |
| Ashtadhyayi | 2700 | 18 | 263 | 510 | 616 |
One more issue noticed with the "Internal" set is that sometimes they use a visarga and sometimes an स्
कोऽसिचत् | कस्+असिचत्
वृक्षश्शेते | वृक्षः+शेते
I am not sure if internal sandhis was a targeted use case. Ditto for AshtadhyayI. After all, the pratyayas and various terms in the sutras wouldn't be in any of the standard dictionaries. This probably explains the somewhat poor performance on those.
Is the lower performance on the literature category attributable to the two differences you mentioned before (splitting samasas and upasargas)?
Internal sandhi includes upasargas, which we do fine at (barring special cases).
The literature case seems to be mostly test problems. On a casual look, it seems the input is often incompletely split in the test.
On Sat, Jan 16, 2021, 5:29 PM Avinash Varna [email protected] wrote:
I am not sure if internal sandhis was a targeted use case. Ditto for AshtadhyayI. After all, the pratyayas and various terms in the sutras wouldn't be in any of the standard dictionaries. This probably explains the somewhat poor performance on those.
Is the lower performance on the literature category attributable to the two differences you mentioned before (splitting samasas and upasargas)?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kmadathil/sanskrit_parser/issues/153#issuecomment-761711518, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKEWNSBL3KAETDAODK62PTS2I4RJANCNFSM4V7MNKWA .
Now that the test is in, but we need to scrub failures - adding this comment to state the remaining task
The task is
- look at
tests/SandhiKosh/Results.xls
, which is generated bytests/SandhiKosh/manual_test.py
- Triage the failures
| Corpus | Total | JNU | UoH | INRIA | sanskrit_parser |
|----------------------+-------+-----+------+-------+-----------------|
| Rule based- Internal | 150 | 10 | 27 | 3 | 14 |
| Rule based- External | 132 | 22 | 48 | 38 | 41 |
| Literature | 150 | 13 | 98 | 101 | 66 |
| Bhagavad-gita | 1430 | 67 | 650 | 962 | 1002 |
| UoH | 9368 | 934 | 6393 | 6490 | 7304 |
| Ashtadhyayi | 2700 | 18 | 263 | 510 | 616 |
- Look for possible causes - we know of many
- Test data has incomplete splits (ie: output is not fully split)
- Upasargas are not split in test data
- Samasas are not split in test data
- Word not in our lexicons (INRIA or sanskrit_data)
- Genuine failure - we need to fix something