sanskrit_parser
sanskrit_parser copied to clipboard
Finding the most likely split using scoring
Opening a new issue to discuss scoring approaches to find the most likely split, such as those raised in https://github.com/kmadathil/sanskrit_parser/issues/84#issuecomment-393878940, https://github.com/kmadathil/sanskrit_parser/issues/84#issuecomment-393893866
Regarding @drdhaval2785's comment - Lexeme frequency based scoring is a traditional SMT-like approach. AFAIK, such purely statistical techniques have been largely superseded by neural network approaches (NMT based). The scoring branch has a first attempt to use a word2vec based scoring approach using sentencepiece for tokenization. This is probably not the state of the art, and it could benefit from more data and some positional information but it seems to work reasonably. This can be activated by passing the --score option to lexical analyzer to score the splits and sort them. The score is a log-likelihood type score so it will be negative, but higher is better.
python -m sanskrit_parser.lexical_analyzer.sanskrit_lexical_analyzer --input-encoding SLP1 "vedAntaSAstraprakriyAm" --split --score
Splits:
([vedAnta, SAstra, prakriyAm], -42.655685)
([vedAnta, SAstra, prakriyA, Am], -53.865582)
([vedAntaSAstra, pra, kriyAm], -60.299915)
([ve, adAnta, SAstra, prakriyAm], -61.211525)
([vedAntaSAstra, prakriyAm], -61.816097)
([vedAntaSAstra, prakriyA, am], -68.705406)
([vedAntaSAstra, prakriyA, Am], -68.761444)
([vedAnta, SAH, tra, prakriyAm], -72.599167)
([vedAntaSAstra, prakri, yAm], -74.081451)
([vedAntaSAstra, pra, kriya, am], -78.149681)
Total time for graph generation + find paths 0:00:02.466483
It does get tripped up with the naraH vA nArI
example that @codito gave, but it would be interesting to collect the overall lexical accuracy using the scripts in metrics. (I haven't been able to run it fully. May need to parallelize it similar to what @codito did with the word accuracy scripts.)
I noticed that OpenNMT lua has added a language model, might be worthwhile trying that out as well.
Any other approaches people have in mind?
([vedAntaSAstra, prakriyAm], -61.816097)
Why is this on fifth rank?
Probably because the long words "vedAntaSAstra" and "prakriyAm" did not occur/co-occur in the training database, whereas the words vedAnta, SAstra prakriyA or similar words occurred more frequently. Isn't the top split a reasonable one?
An update: Running the lexical scoring splits that use test data from here shows that scoring and re-sorting the results does improve the accuracy. The below results are from the first ~1000 sentences in the test dataset. As mentioned in https://github.com/kmadathil/sanskrit_parser/issues/84#issuecomment-371734663 the scripts compute the BLEU and CHRF scores (higher is better for both metrics). In the table below, inria is using just the inria dataset for the lexical lookup and combined uses both inria and sanskrit_data. The second and fourth rows are obtained by scoring the top 10 paths that are obtained from the lexical analyzer and then sorting them by the score (as shown in the example in the first post).
Name | BLEU | CHRF |
---|---|---|
inria | 59.38 | 0.89 |
inria top 10 scored and sorted | 74.88 | 0.93 |
combined | 43.99 | 0.90 |
combined top 10 scored and sorted | 67.66 | 0.95 |
The second and 4th rows have higher BLEU and CHRF scores showing that scoring does help. The combined lookup appears to be better in terms of the CHRF, while using just the inria database appears to get a better BLEU score. I am not sure whether either of these metrics make absolute sense for this task, but they were a good starting point. Subjectively, the combined lookup + sorting the top 10 appears to generate better splits.
Finally trying to get into this - your scores are for the first split, right? @avinashvarna
It's been a while, but using the --score option will cause all paths returned by the findAllPaths to be scored and sorted. The BLEU and CHRF scores reported in https://github.com/kmadathil/sanskrit_parser/issues/93#issuecomment-396025499 are when only the top split (without/with scoring & sorting) is used as the final output. Did I understand your question correctly?
Thanks! The problem here is how we should compute the scores for lexical_analyzer. The top split isn't special - it's just one of the splits. Ordering is something we do. Let me ponder this for a while.
Correct. That is why it currently computes scores for all paths returned by findAllPaths which are themselves sorted by length. So this would be equivalent to finding the N shortest paths and then scoring them using the lexical criterion and then outputting the one with the highest score. This results in the BLEU and CHRF scores reported earlier.
Ah, I think I was confusing myself by overloading "score" incorrectly.
Let me state my current understanding about the scoring branch
- There's a lexical scorer, which is activated by --score, which is a word2vec / sentencepiece model that has been trained on the data here. Right? If so, was it trained on the "train" or "test" sets or both?
- The BLEU/ChrF scoring is independent, and operates on the output of either the lexical analyzer, or lexical analyzer + scoring in 1), right?
Related question. sentencepiece/word2vec should give us an alternate path to directly go to a split sentence without passing through our code, should it not? @avinashvarna
Ah. Now I see the source of the confusion.
- Correct on the behavior of --score. The scorer was trained on the "train" data. It was then tested using the "test" data set to obtain the BLEU/ChrF to figure out if the "lexical scoring and sorting" helps improve the split accuracy.
- Correct. The BLEU/ChrF are computed using the first split output to evaluate the overall approach of just the analyzer vs "analyzer + scoring + sorting".
Sentencepiece is an unsupervized text tokenizer, so it learns some structure from the data, but not enough to take a sentence and split it, I feel. I think of it as a way to learn an efficient dictionary of fixed size to represent a dataset. Word2vec then just maps these tokens to a vector space so that we can then do things we normally do with real-valued vectors, such as measure similarity/distance etc. So just the two of them cannot be used to directly go to a split, AFAIK.
However, they can be used to then train a model to do the splitting without going through our code, which is exactly what this project does. As I mentioned in https://github.com/kmadathil/sanskrit_parser/issues/85#issuecomment-396025900, this approach already gives us better performance on the test data set using these metrics.
For easier comparison, here is the complete table:
Name | BLEU | CHRF |
---|---|---|
inria | 59.38 | 0.89 |
inria top 10 scored and sorted | 74.88 | 0.93 |
combined | 43.99 | 0.90 |
combined top 10 scored and sorted | 67.66 | 0.95 |
transformer model | 83.74 | 0.96 |
The last row is the best (with the usual caveats - on this limited test dataset, using these specifc metrics which may or may not be meaningful).
complete table
Is there an improvement after two years?