sanskrit_parser icon indicating copy to clipboard operation
sanskrit_parser copied to clipboard

Document and Publish

Open kmadathil opened this issue 4 years ago • 13 comments

In 2021, let's consider a) Creating a good document explaining the software design b) Publishing on arxiv.org and in a relevant conference/publication if feasible.

There are gaps (like a good test suite, and a way to measure metrics better as in #84 )

kmadathil avatar Jan 11 '21 20:01 kmadathil

Thoughts, @avinashvarna @vvasuki ?

kmadathil avatar Jan 11 '21 20:01 kmadathil

Chuck the "good document". Use a website instead - (eg. https://jyotisham.github.io/jyotisha/software/ hosted from https://github.com/jyotisham/jyotisha/tree/master/hugo-source ) for design, and readthedocs type service for API doc .

For propaganda via conference/ publication - you anyway need a separate doc (with some content copied form the site).

vvasuki avatar Jan 12 '21 05:01 vvasuki

Good idea! More documentation is always better. I am not very familiar with the Sanskrit NLP conferences, but if there is a novel component in our implementation, then we can submit it.

  1. I don't think @kmadathil meant a word doc, and I am fine with wiki/website as appropriate.
  2. Regarding metrics, I read a few papers published recently on this topic. Several use the sandhikosh described in this paper as a benchmark. I will open an issue to add this to our testing and use the same metrics.

avinashvarna avatar Jan 12 '21 15:01 avinashvarna

An online document/web-page is fine, but we're at a sufficiently advanced stage to document our software design. Once it's documented, we can explore exactly how novel it is. I suspect at least part of it must be, since there aren't many competitors that start from any string and provide a grammatical parse, splitting sandhi along the way. (Are there any that do all of this?)

Avinash, can you post some of the recent literature on sandhi that you read? Thanks for posting sandhikosh, we should migrate our test infrastructure to it as the first step.

kmadathil avatar Jan 12 '21 17:01 kmadathil

Don't the INRIA and UoH tools provide a parse along with the splits? A search also found this link, but I couldn't find an interface/source - https://www.appliedsyntax.com/sanskrit-parsing

Most of the recent literature has been focused on using Neural Network based approaches. Most of the papers on this list for example that use the sandhikosh as the benchmark are a good place to start. The latest paper I read was actually posted on the sanskrit-programmers list - https://www.linkedin.com/posts/prathosh-ap-phd-50ab9511_preprint-of-sandhi-paper-ugcPost-6716237804055199744-JBDu/

avinashvarna avatar Jan 14 '21 15:01 avinashvarna

It would be good to survey alternatives. UoH provides a parse, if you go by their publications. I haven't tried INRIAs parser. A good testset for parses (analogous to SandhiKosh) would be a way to compare ourselves and look for areas of improvement.

kmadathil avatar Jan 14 '21 19:01 kmadathil

From what I could read

  1. UoH does not start from an un-split sentence - they assume a sandhi split sentence. However, their graph based approach is most akin to our vakya analyzer.
  2. SHR (Sanskrit Heritage Reader - Goyal and Huet 2016) splits sentences, but does not do sentence analysis
  3. IIT-Kgp (Pawan Goyal &co) do splitting and joint splitting/morphological analysis, but using NN methods. They have the best results.

If we can grab the DCS10K used by the IIT-Kgp papers, we can compare against them

kmadathil avatar Feb 03 '21 00:02 kmadathil

I am not very familiar with the Sanskrit NLP conferences

There are only two around.

gasyoun avatar Mar 19 '21 19:03 gasyoun

I have updated the documentation with more information about our internals.

kmadathil avatar Mar 19 '21 23:03 kmadathil

a) Creating a good document explaining the software design

Installing instructions would be a good place to start from.

gasyoun avatar Mar 31 '21 13:03 gasyoun

@gasyoun - Install is as simple as pip install sanskrit_parser, which is documented.

If your concern is about the MS VC++ dependency on gensim, please take it up with them. Your alternative is to run Linux (which I recommend), but you'll need to install build-essentials there as well. This is a gensim requirement, not for our code.

We will explore alternatives to gensim.

kmadathil avatar Mar 31 '21 16:03 kmadathil

Your alternative is to run Linux (which I recommend)

To split 4000 samasas? No, thanks. I will have to find an another alternative.

gasyoun avatar Apr 01 '21 14:04 gasyoun

Your alternative is to run Linux (which I recommend)

To split 4000 samasas? No, thanks. I will have to find an another alternative.

Since I like to keep track of such, it'd be useful if you notify us of any alternative you find (preferably accompanied by a brief review).

vvasuki avatar Apr 02 '21 02:04 vvasuki