arabicstemmer
arabicstemmer copied to clipboard
Improving stemmer - milestone 2
- [ ] Clear prefixes first, clear suffixes second
- [ ]
al
kal
fal
bal
bb
should marked first, and setis_noun
- [ ]
aa
ww
ff
should marked first
- [ ]
- [ ] Greedy to choose between nouns suffixes and verb suffixes: طالبات
- [ ] الزمان
- [x] والشمس
- [x] لمعالجة
- [ ] أفنلزمكموها
- [ ] س لا تلتصق إلا بأفعال المضارع ا
- [ ] Detecting است prefix and define using it if noun or verb and also larger the size condition by 3: نسنعين ,
- [ ] in suffixes, جمع مذكر السالم نادرا ماتكون جذع اقل من 4
- [ ] و الفعل المضارع اللواحق يجب أن تترك الحجم 4 لأن للمضارع سابقا من حرف واحد
- [x] study the case of والأمر
- [ ] make suffixes to set/unset is_noun, is_verb
- [ ] don't stem if it contains a number or english number or size = odd
- [ ] define regions before start stemming, test everything then perform stemming
- [ ] black list: Ignore some predefined words, or does it worth
- [ ] remove feminine marks and study feminine patterns
- [ ] remove broken plural infixes: أطفال، كواسر ،نُمور
- [ ] consider vocalization when exists:
- [ ] tanween means a noun
- [ ] detect and process_vocalized texts
- [ ] Study patterns and guess it before stemming
- [ ] Verb conjugation prefixes: a, t, y, n, if it has suffix, then remove the prefix with it
- [ ] Rename routines to better-explaining names
- [ ] study Alef-tanween
- [ ] study idgham
- [ ] Calculate probability of being noun or being verb
- [ ] Prefix confusion
- [ ] 2 letters words
- [ ] improve from ISRI ideas
- [ ] improve from khoja ideas
- [ ] improve from tashaphine ideas
- [ ] optimize performance
- [ ] filter stop words
- [ ] البستان
- [ ] سأَلَهُم و سأُلْهِمُ
- [ ] مدرستي =>مدرس،مدرسة or لعبتي => لعب،لعبة
- [ ] فناء give us ناء also فنون gives نون
- [ ] فردوس gives ردوس
- [ ] handle الزمان as اِلْزمانْ and gives اَلْزً
- [ ] مَالُكَ conflicts with مالِك اسم فاعل للفعل ملك
- [ ] اللغة .
- [x] treat female plural as noun .
- [x] treat مثنى المؤنث as noun .
- [ ] study case of الفعل المهموز.
Improve snowball ArabicStemmer in nltk