multi_rake
multi_rake copied to clipboard
Empty list returned when working with Devanagri Script
Hi im working with texts in Devanagri Script (A Popular script used in India unlike the Latin Script used by English like languages). When I try to generate keywords it returns an empty list. Code is below.
full_text="शेवणें आनी शेतकार एक आसलेलो शेतकार तेणें बरें शेत रोयलेलें रोयल्यार कितें जालें थाम वाडलें आनी इल्लें इल्लें करून पोटराक येयलें आनी थोडे दीस वयतकच कुचकुचीत गोट्याचें कणस सुटलें आनी वाऱ्याचेर बरें धोलूंक लागलें शेतकाराक सामकी उमेद जाली आतां म्हण लागलो रोकडेंच आपूण शेत लुंवतलो आनी भात घरा व्ह"
rake = Rake(max_words_unknown_lang=1)
keywords = rake.apply(full_text)
It's hard for me to fix it without at least basic knowledge of this script.
I can point you to the problem in the code, though.
There is regexp \p{L}+
that processes input text in order to count words properly. It keeps only letters. hello, world!
is transformed into hello world
.
When I pass शेवणें आनी शेतकार
, it is transformed into शे वणें आनी शे तका र
. It introduces additional spaces that break subsequent logic. In order to keep it in line with the general logic, it should have stayed as शेवणें आनी शेतकार
.
Maybe we don't need to use regexp for this script and split sentences on white spaces? I have no idea whether this is the right thing to do.
Yes splitting on whitespace is the way to go. It would be great if that was an option. Clean out punctuation and then splitting the sentence into words using whitespace.
On Fri, Jun 25, 2021, 12:24 PM Vitaliy @.***> wrote:
It's hard for me to fix it without at least basic knowledge of this script. I can point you to the problem in the code, though. There is regexp \p{L}+ that processes input text in order to count words properly. It keeps only letters. hello, world! is transformed into hello world. When I pass शेवणें आनी शेतकार, it is transformed into शे वणें आनी शे तका र. It introduces additional spaces that break subsequent logic. In order to keep it in line with the general logic, it should have stayed as शेवणें आनी शेतकार. Maybe we don't need to use regexp for this script and split sentences on white spaces? I have no idea whether this is the right thing to do.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgrabovets/multi_rake/issues/38#issuecomment-868269542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ4ITWK37Q4JSB3BT7REE43TUQR3DANCNFSM46UQ3OLA .