sanskrit_parser
sanskrit_parser copied to clipboard
Use word frequencies for trimming split graphs?
DCS word frequencies have been publicly available for a while now - here and also on couchdb . You might find it useful to pare down possible sandhi-splits etc..
The frequencies here appear to be out of date. E.g. this query returns 951 rows, whereas the frequency in the csv for "a" is 380. Should I file an issue at stardict-sanskrit to update the frequencies?
पुरातनम् हि सङ्ख्यास् ताः। आम् - स्थापयतु नाम ज्ञापकम्।
Is it about words only or word-combinations? I mean does it make sense extracting word combination frequency from DCS for this task?
In general n-gram models (n>1) tend to give better results, so you are right that extracting word combinations would make more sense. However, given that word order is irrelevant in many cases in Sanskrit, it might have to be extended to be more of "co-occurrence within the same sentence" probabilities than just sequential word combinations. I am starting to do some experiments on DCS data to get a better understanding of what is possible. First exploration posted here might be of interest to some of you - https://github.com/avinashvarna/dcs_experiments/blob/master/word2vec_experiments.ipynb
I think we should use this (dcs co-occurance) to rank splits that have been marked morphologically valid. The morpho branch is progressing, albeit slowly. See #28
Step 1: Prune the lexical DAG removing stuff that we can remove (nothing done yet) Step 2: Generate paths, and prune paths using path constraints Step 3: Rank remaining paths using DCS co-occurence
word order is irrelevant in many cases
Do you have stats to prove the hypothesis?
Stats? गच्छाम्यहम् and अहङ्गच्छामि are equivalent. That a particular order is favoured does not mean we've got to enforce that, IMO. I'd rather allow morphologically valid splits even if they don't follow the order preferred by classical authors.
Stats
Yes. After Whitney Sanskrit is treated with stats and Oliver continues Whitney's path. I recommend you to read http://www.springer.com/in/book/9789027705495
गच्छम्यहम् and अहङ्गच्छामि are equivalent
Usage is what matters. The fact that it's understandable does not mean they occur equally often.
Thanks for the link, I will read that.
However, within the context of this project, we're worried about legitimacy rather than usage. Something that's used less, but still legitimate, must be flagged as valid. Unless we have traditional grammarians that prohibit certain usages, we must consider them legitimate.
Usage is what matters. The fact that it's understandable does not mean they occur equally often.
However, I think you have a point on usage. We should rely on usage for ordering splits once we've determined the valid ones.
Is there a summary of this monograph online? It's quite pricey to buy it, and google books preview only shows the introduction (in the US at least). Maybe I'll look for Dr. Hellwig's publications that cite this work and try to infer what the original might have.
I will look at some stats from the DCS data and see what we can infer from them regarding word order. E.g. look at relative position of where the kriyApada occurs in a sentence.
@avinashvarna some russkies I like have us covered
@vvasuki उपकृतोहं सङ्केतेनास्यपुस्तकस्य ! @gasyoun Thanks for introducing us to this. Our approach should end up less naive as a result.
@gasyoun I am still going through the details, but the summary and conclusions section at least does not contradict the assertion that word order is largely irrelevant, as considered by Indian grammarians. Is there a specific section where this assertion is contradicted?
Usage is what matters. The fact that it's understandable does not mean they occur equally often.
Depends on what the goal is. If it is to rank sentences in a statistical model based on the likelihood of occurrence, then you are correct. If the goal is to identify grammatically valid sentences, as @kmadathil mentioned, then statistics don't matter. To quote from the monograph you referred to:
"Adapting these criticisms to the interpretation of the Western approach, we may state that the latter deals with usage and utterances, rather than with the grammaticality of sentences."
It also cautions against being completely "corpus-driven":
In this connection it may be noted that the Indian theorists (including men like Apte) were Sans~ krit speakers, whereas Western Sanskritists have generally studied Sanskrit from the corpus of Sanskrit texts, in the same manner in which scholars of Latin have studied Latin.
Even in my original statement, I said that we may have to adapt existing metrics to be more suitable for Sanskrit. Do you see any issue with that?
I gather the following from reading the document mentioned above:
- Indian Grammarians are unanimous that word order is irrelevant (with few exceptions, such as upasargas before dhAtus)
- Western Sanskritists prefer a statistical approach, and are more keen to adopt word-order constraints.
From the conclusion of the document: "It is easy, but wrong, to equate 'statistically normal' with 'natural' and 'statisti- _ cally abnormal' with 'distorted', 'inverted', etc."
Given the above, I would go with the Indian approach, while retaining the possibility of using frequency for ordering amongst valid alternatives (which are determined without recourse to word order where not grammatically relevant)
For sandhi frequency is a must.