sanskrit_parser Add Sandhikosh to testing

Recent publications use the sandhikosh described in this paper as a benchmark. Let's add it to our testing and see where we stand.

(Related to #84)

Jan 12 '21 15:01 avinashvarna

Since this subsumes UoHD, I think we can make this our primary test corpus for sandhi.

We need to find a corpus for parsing.

Jan 12 '21 17:01 kmadathil

I can see some erroneous spaces (which we can remove programmatically) and clear bad splits in sandhikosh.
They don't split some samAsas (which we do) and usually do not split upasargas (which we also do).

I counted 881 passes and 549 fails on the BhagavadGitA corpus (no edits) and 428 failed, 1002 passed with automated edits to remove spaces.

Some samples, showing issues in the sandhikosh

FAILED test_SandhiKosh.py::test_file_splits[kosh_entry949] - AssertionError: assert ['cittam', 'nirudDam', 'yogasevayA '] in [['cittam', 'nirudDam', 'yoga', 'sevayA'], ['cit', 'tat...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry950] - AssertionError: assert ['ca', 'eva', 'AtmanA', 'AtmAnam paSyan', 'Atmani'] in [['ca', 'eva', 'AtmanA', 'AtmAnam', 'paSy...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry952] - AssertionError: assert ['budDigrAhyam', 'ati', 'indriyam '] in [['budDi', 'grAhyam', 'atIndriyam'], ['budDi', 'grAhyam'...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry954] - AssertionError: assert ['sTitaH', 'calatitattvataH'] in [['sTitaH', 'calati', 'tat', 'tu', 'ataH'], ['sTitaH', 'calati'...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry959] - AssertionError: assert ['guruRA', 'api '] in [['guruRA', 'api'], ['guruRA', 'pi'], ['guruRA', 'Api'], ['guru', 'Ra', 'a...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry960] - AssertionError: assert ['tam', 'vidyAt', 'duHKasaMyogaviyogam', 'yogasaYjYitam '] in [['tam', 'vidyAt', 'duHKa', 'saMyo...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry962] - AssertionError: assert ['yoktavyaH', 'yogaH', 'anirviRRacetasA'] in [['yoktavyaH', 'yogaH', 'asni', 'ru', 'iw', 'Ra', ....
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry963] - AssertionError: assert ['sam', 'kalpapraBavAn', 'kAmAn', 'tyaktvA '] in [['sam', 'kalpa', 'praBavAn', 'kAmAn', 'tyaktvA...
FAILED test_SandhiKosh.py::test_file_splits[kosh_entry966] - AssertionError: assert ['samam', 'tataH '] in [['samam', 'tataH'], ['samantataH'], ['samam', 'tat', 'aH'], ['samam', 't...
FAI

Jan 13 '21 02:01 kmadathil

Take a look at branch multigraph, tests/SandhiKosh. manual_test.py runs tests and outputs to Results.xls. I've run for 1000 tests, with 622 passes. I will run for the full dataset next.

Jan 13 '21 23:01 kmadathil

Updated - 11080 Tests: 8413 Passed, 1232 Failed, 1430 No_Split, 5 Bad tests

Jan 14 '21 03:01 kmadathil

Going by the SandhiKosh paper, we are already better than the best result they report (INRIA) for the subset that I ran (BG, Literature, External, UoH).

Jan 14 '21 05:01 kmadathil

That's quite impressive! Thanks for adding this. We can look at the failed ones to understand what's happening. I will try to spend some time on it this weekend.

Jan 14 '21 15:01 avinashvarna

Two big sources of discrepancy - SandhiKosh doesn't split some samAsas (which we do) and usually does not split upasargas which we also do, and IMO should. Both of these are proper pada boundaries.

Jan 14 '21 19:01 kmadathil

This is where we stand on passes:

| Corpus               | Total | JNU |  UoH | INRIA | sanskrit_parser |
|----------------------+-------+-----+------+-------+-----------------|
| Rule based- Internal |   150 |  10 |   27 |     3 |              14 |
| Rule based- External |   132 |  22 |   48 |    38 |              41 |
| Literature           |   150 |  13 |   98 |   101 |              66 |
| Bhagavad-gita        |  1430 |  67 |  650 |   962 |            1002 |
| UoH                  |  9368 | 934 | 6393 |  6490 |            7304 |
| Ashtadhyayi          |  2700 |  18 |  263 |   510 |             616 |

One more issue noticed with the "Internal" set is that sometimes they use a visarga and sometimes an स्

कोऽसिचत् | कस्+असिचत्

वृक्षश्शेते | वृक्षः+शेते

Jan 16 '21 00:01 kmadathil

I am not sure if internal sandhis was a targeted use case. Ditto for AshtadhyayI. After all, the pratyayas and various terms in the sutras wouldn't be in any of the standard dictionaries. This probably explains the somewhat poor performance on those.

Is the lower performance on the literature category attributable to the two differences you mentioned before (splitting samasas and upasargas)?

Jan 17 '21 01:01 avinashvarna

Internal sandhi includes upasargas, which we do fine at (barring special cases).

The literature case seems to be mostly test problems. On a casual look, it seems the input is often incompletely split in the test.

On Sat, Jan 16, 2021, 5:29 PM Avinash Varna [email protected] wrote:

I am not sure if internal sandhis was a targeted use case. Ditto for AshtadhyayI. After all, the pratyayas and various terms in the sutras wouldn't be in any of the standard dictionaries. This probably explains the somewhat poor performance on those.

Is the lower performance on the literature category attributable to the two differences you mentioned before (splitting samasas and upasargas)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kmadathil/sanskrit_parser/issues/153#issuecomment-761711518, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKEWNSBL3KAETDAODK62PTS2I4RJANCNFSM4V7MNKWA .

Jan 17 '21 06:01 kmadathil

Now that the test is in, but we need to scrub failures - adding this comment to state the remaining task

The task is

look at tests/SandhiKosh/Results.xls, which is generated by tests/SandhiKosh/manual_test.py
Triage the failures

| Corpus               | Total | JNU |  UoH | INRIA | sanskrit_parser |
|----------------------+-------+-----+------+-------+-----------------|
| Rule based- Internal |   150 |  10 |   27 |     3 |              14 |
| Rule based- External |   132 |  22 |   48 |    38 |              41 |
| Literature           |   150 |  13 |   98 |   101 |              66 |
| Bhagavad-gita        |  1430 |  67 |  650 |   962 |            1002 |
| UoH                  |  9368 | 934 | 6393 |  6490 |            7304 |
| Ashtadhyayi          |  2700 |  18 |  263 |   510 |             616 |

Look for possible causes - we know of many
1. Test data has incomplete splits (ie: output is not fully split)
2. Upasargas are not split in test data
3. Samasas are not split in test data
4. Word not in our lexicons (INRIA or sanskrit_data)
5. Genuine failure - we need to fix something

Feb 03 '21 00:02 kmadathil

sanskrit_parser sanskrit_parser copied to clipboard

Add Sandhikosh to testing

sanskrit_parser
sanskrit_parser copied to clipboard