sanskrit_parser icon indicating copy to clipboard operation
sanskrit_parser copied to clipboard

Use word frequencies for trimming split graphs?

Open vvasuki opened this issue 7 years ago • 15 comments

DCS word frequencies have been publicly available for a while now - here and also on couchdb . You might find it useful to pare down possible sandhi-splits etc..

vvasuki avatar Jul 31 '17 20:07 vvasuki

The frequencies here appear to be out of date. E.g. this query returns 951 rows, whereas the frequency in the csv for "a" is 380. Should I file an issue at stardict-sanskrit to update the frequencies?

avinashvarna avatar Aug 02 '17 18:08 avinashvarna

पुरातनम् हि सङ्ख्यास् ताः। आम् - स्थापयतु नाम ज्ञापकम्।

vvasuki avatar Aug 02 '17 19:08 vvasuki

Is it about words only or word-combinations? I mean does it make sense extracting word combination frequency from DCS for this task?

gasyoun avatar Aug 08 '17 21:08 gasyoun

In general n-gram models (n>1) tend to give better results, so you are right that extracting word combinations would make more sense. However, given that word order is irrelevant in many cases in Sanskrit, it might have to be extended to be more of "co-occurrence within the same sentence" probabilities than just sequential word combinations. I am starting to do some experiments on DCS data to get a better understanding of what is possible. First exploration posted here might be of interest to some of you - https://github.com/avinashvarna/dcs_experiments/blob/master/word2vec_experiments.ipynb

avinashvarna avatar Aug 09 '17 06:08 avinashvarna

I think we should use this (dcs co-occurance) to rank splits that have been marked morphologically valid. The morpho branch is progressing, albeit slowly. See #28

Step 1: Prune the lexical DAG removing stuff that we can remove (nothing done yet) Step 2: Generate paths, and prune paths using path constraints Step 3: Rank remaining paths using DCS co-occurence

kmadathil avatar Aug 09 '17 06:08 kmadathil

word order is irrelevant in many cases

Do you have stats to prove the hypothesis?

gasyoun avatar Aug 09 '17 06:08 gasyoun

Stats? गच्छाम्यहम् and अहङ्गच्छामि are equivalent. That a particular order is favoured does not mean we've got to enforce that, IMO. I'd rather allow morphologically valid splits even if they don't follow the order preferred by classical authors.

kmadathil avatar Aug 09 '17 07:08 kmadathil

Stats

Yes. After Whitney Sanskrit is treated with stats and Oliver continues Whitney's path. I recommend you to read http://www.springer.com/in/book/9789027705495

गच्छम्यहम् and अहङ्गच्छामि are equivalent

Usage is what matters. The fact that it's understandable does not mean they occur equally often.

gasyoun avatar Aug 09 '17 10:08 gasyoun

Thanks for the link, I will read that.

However, within the context of this project, we're worried about legitimacy rather than usage. Something that's used less, but still legitimate, must be flagged as valid. Unless we have traditional grammarians that prohibit certain usages, we must consider them legitimate.

Usage is what matters. The fact that it's understandable does not mean they occur equally often.

However, I think you have a point on usage. We should rely on usage for ordering splits once we've determined the valid ones.

kmadathil avatar Aug 09 '17 10:08 kmadathil

Is there a summary of this monograph online? It's quite pricey to buy it, and google books preview only shows the introduction (in the US at least). Maybe I'll look for Dr. Hellwig's publications that cite this work and try to infer what the original might have.

I will look at some stats from the DCS data and see what we can infer from them regarding word order. E.g. look at relative position of where the kriyApada occurs in a sentence.

avinashvarna avatar Aug 09 '17 17:08 avinashvarna

@avinashvarna some russkies I like have us covered

vvasuki avatar Aug 09 '17 17:08 vvasuki

@vvasuki उपकृतोहं‌ सङ्केतेनास्यपुस्तकस्य ! @gasyoun Thanks for introducing us to this. Our approach should end up less naive as a result.

kmadathil avatar Aug 10 '17 03:08 kmadathil

@gasyoun I am still going through the details, but the summary and conclusions section at least does not contradict the assertion that word order is largely irrelevant, as considered by Indian grammarians. Is there a specific section where this assertion is contradicted?

Usage is what matters. The fact that it's understandable does not mean they occur equally often.

Depends on what the goal is. If it is to rank sentences in a statistical model based on the likelihood of occurrence, then you are correct. If the goal is to identify grammatically valid sentences, as @kmadathil mentioned, then statistics don't matter. To quote from the monograph you referred to:

"Adapting these criticisms to the interpretation of the Western approach, we may state that the latter deals with usage and utterances, rather than with the grammaticality of sentences."

It also cautions against being completely "corpus-driven":

In this connection it may be noted that the Indian theorists (including men like Apte) were Sans~ krit speakers, whereas Western Sanskritists have generally studied Sanskrit from the corpus of Sanskrit texts, in the same manner in which scholars of Latin have studied Latin.

Even in my original statement, I said that we may have to adapt existing metrics to be more suitable for Sanskrit. Do you see any issue with that?

avinashvarna avatar Aug 10 '17 17:08 avinashvarna

I gather the following from reading the document mentioned above:

  1. Indian Grammarians are unanimous that word order is irrelevant (with few exceptions, such as upasargas before dhAtus)
  2. Western Sanskritists prefer a statistical approach, and are more keen to adopt word-order constraints.

From the conclusion of the document: "It is easy, but wrong, to equate 'statistically normal' with 'natural' and 'statisti- _ cally abnormal' with 'distorted', 'inverted', etc."

Given the above, I would go with the Indian approach, while retaining the possibility of using frequency for ordering amongst valid alternatives (which are determined without recourse to word order where not grammatically relevant)

kmadathil avatar Oct 02 '17 15:10 kmadathil

For sandhi frequency is a must.

gasyoun avatar Apr 21 '20 09:04 gasyoun