sanskrit_parser icon indicating copy to clipboard operation
sanskrit_parser copied to clipboard

कविका तु खलीनोऽस्त्री कविकं कर्षणीत्यपि

Open kmadathil opened this issue 5 years ago • 9 comments

कविका तु खलीनोऽस्त्री कविकं कर्षणीत्यपि ।_Split --- Please enter your issue below --- नह्येकमपि समीचीनः सन्धिच्छेदो लभ्यते।

@vvasuki-ना प्रेषितः सन्देशः

खलीन इति शब्दोस्माकं कोशेषु नैव दृश्यते । तस्मादस्य वाक्यस्य विच्छेदोप्यसम्यक्कृतः । खलिन इति रूपभेदो दृश्यते, तद्युकतवाक्यस्य सन्धिच्छेदोपि यथाकामं भवति

kavikA, kavikam, and karShaNI are in our dictionary, but khalInaH is not. (khalinaH is, though)

scripts/sanskrit_parser sandhi kavikAtuKalinostrikavikaMkarzaRItyapi --inp SLP1 
Interpreting input loosely (strict_io set to false)
Input String: kavikAtuKalinostrikavikaMkarzaRItyapi
Input String in SLP1: kavikAtuKalinostrikavikaNkarzaRItyapi
Start Split
End DAG generation
End pathfinding 1580756661.3418202
Splits:
[kavi, kA, tu, KalinoH, tri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, KalinaH, astri, kavikam, karzaRI, iti, api]
[kavikA, tu, KalinaH, astri, kavikam, karzaRI, iti, api]  **
[kavi, kA, tu, KalinoH, trika, vikam, karzaRI, iti, api]
[kavi, kAtu, KalinaH, astri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, Kali, noH, tri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, KalinaH, aH, tri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, KalinoH, tri, kavi, kam, karzaRI, iti, api]
[kavi, kA, tu, KalinaH, aH, trika, vikam, karzaRI, iti, api]
[ka, vi, kA, tu, KalinaH, astri, kavikam, karzaRI, iti, api]
-----------
Performance
Time for graph generation = 1.391314s
Total time for graph generation + find paths = 1.477661s

kmadathil avatar Feb 03 '20 19:02 kmadathil

Currently we use data from MW and Inria. So neither contains this word. Should we have a "user-contributed list" of missing words? Or add other dictionaries? If we add other dictionaries, we have to handle the problem of the same word appearing in multiple dictionaries. This is already a problem with the two sources we use, leading to the same split showing up twice sometimes.

avinashvarna avatar Apr 18 '20 17:04 avinashvarna

I'm leaning towards a contrib dictionary - we could maintain a list of words not in INRIA/skt_data, in a specific format, and optionally enable or disable that list of words (orthogonal to the other dictionaries). This could simply be a JSON of words/tags. What process should we follow for adding/deleting words?

kmadathil avatar Apr 18 '20 22:04 kmadathil

I strongly recommend against a user contributed dictionary. If the largest dictionary does not contain this word, let us leave it.

drdhaval2785 avatar Apr 19 '20 03:04 drdhaval2785

How about VCP and SKD? Also apte? These are the standard dictionaries that include a wide range of words. One could easily dedup them and keep one instance of the words in one corpus.

Actually when I wrote a simple samasa splitter based on Dr.Dhaval's code, I had created a dictionary AM - which had Apte, MW, SKD and VCP words deduped.

I'm also against the user-contributed dictionary as it will get out of control IMO

poojapi avatar Apr 19 '20 03:04 poojapi

If we want to cover all Sanskrit dictionaries, there is already a readymade resource. No need to apply fresh mind.

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/sanhw1/hwnorm1c.txt

Roughly 400000 headwords, deduplicated.

drdhaval2785 avatar Apr 19 '20 04:04 drdhaval2785

Input from @drdhaval2785 and @poojapi makes sense. Maintaining a user-contrib dictionary can lead to inaccuracies which can be a headache to deal with.

I will take a look at the hwnorm1c.txt to see if we can switch to using that.

avinashvarna avatar Apr 19 '20 04:04 avinashvarna

Unfortunately, hwnorm1c.txt doesn't appear to have linga/avyaya/gaNa information. @drdhaval2785/@poojapi - are you aware of any similar deduped resource that includes this information?

avinashvarna avatar Apr 19 '20 05:04 avinashvarna

Just to understand, is there actually a need for linga, avyaya, gana information?

Do we generate the forms on the fly, based on these information?

drdhaval2785 avatar Apr 19 '20 05:04 drdhaval2785

Yes, we use the sanskrit_util library which uses the avyaya/linga information to analyze forms on the fly. Otherwise, we would have to store all possible vibhakti/vachana combinations for all nAmapadas (which is the INRIA approach).

avinashvarna avatar Apr 19 '20 07:04 avinashvarna