sanskrit_parser
sanskrit_parser copied to clipboard
कविका तु खलीनोऽस्त्री कविकं कर्षणीत्यपि
कविका तु खलीनोऽस्त्री कविकं कर्षणीत्यपि ।_Split --- Please enter your issue below --- नह्येकमपि समीचीनः सन्धिच्छेदो लभ्यते।
@vvasuki-ना प्रेषितः सन्देशः
खलीन इति शब्दोस्माकं कोशेषु नैव दृश्यते । तस्मादस्य वाक्यस्य विच्छेदोप्यसम्यक्कृतः । खलिन इति रूपभेदो दृश्यते, तद्युकतवाक्यस्य सन्धिच्छेदोपि यथाकामं भवति
kavikA, kavikam, and karShaNI are in our dictionary, but khalInaH is not. (khalinaH is, though)
scripts/sanskrit_parser sandhi kavikAtuKalinostrikavikaMkarzaRItyapi --inp SLP1
Interpreting input loosely (strict_io set to false)
Input String: kavikAtuKalinostrikavikaMkarzaRItyapi
Input String in SLP1: kavikAtuKalinostrikavikaNkarzaRItyapi
Start Split
End DAG generation
End pathfinding 1580756661.3418202
Splits:
[kavi, kA, tu, KalinoH, tri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, KalinaH, astri, kavikam, karzaRI, iti, api]
[kavikA, tu, KalinaH, astri, kavikam, karzaRI, iti, api] **
[kavi, kA, tu, KalinoH, trika, vikam, karzaRI, iti, api]
[kavi, kAtu, KalinaH, astri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, Kali, noH, tri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, KalinaH, aH, tri, kavikam, karzaRI, iti, api]
[kavi, kA, tu, KalinoH, tri, kavi, kam, karzaRI, iti, api]
[kavi, kA, tu, KalinaH, aH, trika, vikam, karzaRI, iti, api]
[ka, vi, kA, tu, KalinaH, astri, kavikam, karzaRI, iti, api]
-----------
Performance
Time for graph generation = 1.391314s
Total time for graph generation + find paths = 1.477661s
Currently we use data from MW and Inria. So neither contains this word. Should we have a "user-contributed list" of missing words? Or add other dictionaries? If we add other dictionaries, we have to handle the problem of the same word appearing in multiple dictionaries. This is already a problem with the two sources we use, leading to the same split showing up twice sometimes.
I'm leaning towards a contrib dictionary - we could maintain a list of words not in INRIA/skt_data, in a specific format, and optionally enable or disable that list of words (orthogonal to the other dictionaries). This could simply be a JSON of words/tags. What process should we follow for adding/deleting words?
I strongly recommend against a user contributed dictionary. If the largest dictionary does not contain this word, let us leave it.
How about VCP and SKD? Also apte? These are the standard dictionaries that include a wide range of words. One could easily dedup them and keep one instance of the words in one corpus.
Actually when I wrote a simple samasa splitter based on Dr.Dhaval's code, I had created a dictionary AM - which had Apte, MW, SKD and VCP words deduped.
I'm also against the user-contributed dictionary as it will get out of control IMO
If we want to cover all Sanskrit dictionaries, there is already a readymade resource. No need to apply fresh mind.
https://github.com/sanskrit-lexicon/hwnorm1/blob/master/sanhw1/hwnorm1c.txt
Roughly 400000 headwords, deduplicated.
Input from @drdhaval2785 and @poojapi makes sense. Maintaining a user-contrib dictionary can lead to inaccuracies which can be a headache to deal with.
I will take a look at the hwnorm1c.txt to see if we can switch to using that.
Unfortunately, hwnorm1c.txt doesn't appear to have linga/avyaya/gaNa information. @drdhaval2785/@poojapi - are you aware of any similar deduped resource that includes this information?
Just to understand, is there actually a need for linga, avyaya, gana information?
Do we generate the forms on the fly, based on these information?
Yes, we use the sanskrit_util library which uses the avyaya/linga information to analyze forms on the fly. Otherwise, we would have to store all possible vibhakti/vachana combinations for all nAmapadas (which is the INRIA approach).