multi_rake icon indicating copy to clipboard operation
multi_rake copied to clipboard

Empty list returned when working with Devanagri Script

Open jovidsilva opened this issue 3 years ago • 2 comments

Hi im working with texts in Devanagri Script (A Popular script used in India unlike the Latin Script used by English like languages). When I try to generate keywords it returns an empty list. Code is below.

full_text="शेवणें आनी शेतकार एक आसलेलो शेतकार तेणें बरें शेत रोयलेलें रोयल्यार कितें जालें थाम वाडलें आनी इल्लें इल्लें करून पोटराक येयलें आनी थोडे दीस वयतकच कुचकुचीत गोट्याचें कणस सुटलें आनी वाऱ्याचेर बरें धोलूंक लागलें शेतकाराक सामकी उमेद जाली आतां म्हण लागलो रोकडेंच आपूण शेत लुंवतलो आनी भात घरा व्ह"

rake = Rake(max_words_unknown_lang=1)

keywords = rake.apply(full_text)

jovidsilva avatar Jun 14 '21 05:06 jovidsilva

It's hard for me to fix it without at least basic knowledge of this script. I can point you to the problem in the code, though. There is regexp \p{L}+ that processes input text in order to count words properly. It keeps only letters. hello, world! is transformed into hello world. When I pass शेवणें आनी शेतकार, it is transformed into शे वणें आनी शे तका र. It introduces additional spaces that break subsequent logic. In order to keep it in line with the general logic, it should have stayed as शेवणें आनी शेतकार. Maybe we don't need to use regexp for this script and split sentences on white spaces? I have no idea whether this is the right thing to do.

vgrabovets avatar Jun 25 '21 06:06 vgrabovets

Yes splitting on whitespace is the way to go. It would be great if that was an option. Clean out punctuation and then splitting the sentence into words using whitespace.

On Fri, Jun 25, 2021, 12:24 PM Vitaliy @.***> wrote:

It's hard for me to fix it without at least basic knowledge of this script. I can point you to the problem in the code, though. There is regexp \p{L}+ that processes input text in order to count words properly. It keeps only letters. hello, world! is transformed into hello world. When I pass शेवणें आनी शेतकार, it is transformed into शे वणें आनी शे तका र. It introduces additional spaces that break subsequent logic. In order to keep it in line with the general logic, it should have stayed as शेवणें आनी शेतकार. Maybe we don't need to use regexp for this script and split sentences on white spaces? I have no idea whether this is the right thing to do.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgrabovets/multi_rake/issues/38#issuecomment-868269542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ4ITWK37Q4JSB3BT7REE43TUQR3DANCNFSM46UQ3OLA .

jovidsilva avatar Jun 25 '21 07:06 jovidsilva