sanskrit_parser
sanskrit_parser copied to clipboard
Bhagavad Gita 1.1 returns no parse results:
Minimal example:
from sanskrit_parser import Parser
def api_example(string, output_encoding):
buf = []
parser = Parser(output_encoding=output_encoding)
buf.append('Splits:')
for split in parser.split(string, limit=2):
buf.append(f'Lexical Split: {split}')
for i, parse in enumerate(split.parse(limit=2)):
buf.append(f'Parse {i}')
buf.append(f'{parse}')
return '\n'.join(buf)
for phrase in [
'Darmakzetre kurukzetre samavetA yuyutsavaH',
'mAmakAH pARqavAScEva kimakurvata saMjaya',
]:
resp = api_example(phrase, 'slp1')
print(resp)
Each phrase has lexical splits with no parse information. Output is:
Splits:
Lexical Split: ['Darmakzetre', 'kurukzetre', 'samavetAH', 'yuyutsavaH']
Lexical Split: ['Darmakzetre', 'kurukzetre', 'samavetA', 'yuyutsavaH']
And:
Splits:
Lexical Split: ['mAmakAH', 'pARqavAH', 'cA', 'Eva', 'kim', 'akurvata', 'saYjaya']
Lexical Split: ['mAmakAH', 'pARqavAH', 'ca', 'eva', 'kim', 'akurvata', 'saYjaya']
I'm not sure what I'm doing wrong here -- would appreciate any help you can provide.
Also, I get around 1.8 seconds per verse:
import time
num_trials = 20
start = time.time()
for i, phrase in enumerate([
'Darmakzetre kurukzetre samavetA yuyutsavaH',
'mAmakAH pARqavAScEva kimakurvata saMjaya',
] * num_trials):
resp = api_example(phrase, 'slp1')
end = time.time()
print((end - start) / num_trials)
Is there anything we can do to improve performance here? Ideally I'd like around 100ms per verse.
@akprasad Arun, thanks for reporting this. This is what's going on
- We need a verb of some sort to anchor the parse. However, since samavetAH is a kta form, that should qualify, and so we should be able to generate a parse (though not the correct one without the rest of the sentence). I see that samaveta isn't tagged as a kta in the dictionary. I validated this by parsing Darmakzetre kurukzetre samAgatA yuyutsavaH
- I will dig into why the second part isn't being parsed.
- Ideally, we should be parsing the entire sentence at a time - Darmakzetre kurukzetre samavetA yuyutsavaH mAmakAH pARqavAScEva kimakurvata saMjaya' to get the correct parse. This is being held up by whatever is holding up 2.
I will post an update after digging further.
'Darmakzetre kurukzetre samavetAH yuyutsavaH kim akurvata saMjaya' now parses correctly, and has been added to the test suite.
This takes roughly 400ms
Hi Arun @akprasad
Can we get on a conf call to discuss this weekend to discuss use modes? (Would be good if @avinashvarna can join too). I can demo the UI to you so you can switch to that (or our command line) from the api.
The flow I have in mind (and which is what our UI does) is a two step process
- Split sandhis, and let the user pick the right split from an ordered list
- Parse the sentence with sandhi split.
I would like to understand your perspective on how you see yourself using this.
Sure, let's sync over email.