open-tamil
open-tamil copied to clipboard
Corpus - given a corpora generate uni, bi-gram data
Corpus - given a corpora generate uni, bi-gram data
This data can be used in tasks like prediction of words, and correction of spelling etc.
The analysis task is captured here; take a Wikipedia, Project Madurai or some other corpus (web based) and generate the data as text files.
This data will be in a format that can be used in Bayesian filters, and n-gram predictors (separate task).
what do you mean by corpus here? Just a collection of unique words from wikipedia?
https://github.com/tshrinivasan/tamil-wikipedia-word-list/blob/master/tamil.sort.unique is this kind of file enough?
@tshrinivasan எனக்குப் புரிந்த வகையில்: அந்த tamil.sort.unique கோப்பு "unigram data"வாகப் பயன்படலாம், அது அண்ணாமலை அவர்கள் சொன்னவற்றில் "correction of spelling" பகுதிக்குப் பயன்படும். ஆனால் "bigram data", "Bayesian filters", "n-gram predictors" இதுக்கெல்லாம் எந்த வார்த்தைக்குப் பின் எது வந்தது அப்படிங்கற தகவலும் தேவை, அதுக்குத் திரும்பவும் அந்த விக்கிபீடியா dump-ஐ எடுத்துத் துழாவ வேண்டும்.
Regarding the issue itself: An interesting side effect of the வலி மிகுதல் aspect of Tamil grammar is it could make predictions much better: once I type something like "படித்துப்", the next word is almost certainly going to start with "ப", and so words starting with other letters can be taken down in probability a lot - making the prediction much more likely to be correct. The algorithm will automatically take of this happening though, so this is just an idle observation of how aspects of grammar - decided thousands of years ago - end up helping computationally in the modern age!
Nice. Thanks for the explanation @digital-carver
Found this code helps.
import nltk words = nltk.word_tokenize(my_text) my_bigrams = nltk.bigrams(words) my_trigrams = nltk.trigrams(words)
Can we use nltk or do we need to write from scratch to find bigram?
@tshrinivasan - thanks for researching this; I would like to gather some data for known corpus like Project Madurai and Wikipedia and have this information available as part of open-tamil module; so we can write code like this,
from tamil import langmodel
wikiModel_prob = langmodel.Wikipedia.get_probability("TAMIL WORD HERE",langmodle.BIGRAM)
i.e. end user need not have much expertise with corpus to use the corpus derived info.
So I'm open to using NLTK to get the data but not have a NLTK dependency.
@digital-carver : thanks for nice insights. Open-Tamil has a Trie data structure we can use to represent word n-gram ordering or letter n-gram ordering sequences. This will be based off a corpus and it can be provided as data from open-tamil tamil.langmodel
(a new module).