rake-text-ruby icon indicating copy to clipboard operation
rake-text-ruby copied to clipboard

Should not split hypenated words

Open arjunmenon opened this issue 6 years ago • 0 comments

Hey The text splits hypenated words. Without it the word loses its relevance in a sentence.

text = "India's largest lender State Bank of India SBI and India Mortgage Guarantee Corporation (IMGC) signed a pact to offer mortgage guarantee scheme for prospective non-salaried and self-employed home loan customers. The offering will help increase home loan eligibility up to 15% within the regulatory norms. The MoU between SBI and IMGC is a strategic initiative which will enable to improve housing loan volumes in the non-salaried segment."
irb(main):004:0> rake.analyse text, RakeText.FOX, true
23.83 - help increase home loan eligibility
16.83 - self-employed home loan customers
16.33 - improve housing loan volumes
16.00 - offer mortgage guarantee scheme
14.33 - india mortgage guarantee corporation
4.00 - strategic initiative
4.00 - regulatory norms
4.00 - largest lender
3.83 - india sbi
3.50 - -salaried segment # <==== here, the word was non-salaried. It removed "non"
2.33 - india
1.50 - sbi
1.50 - -salaried
1.00 - enable
1.00 - mou
1.00 - offering
1.00 - prospective
1.00 - pact
1.00 - signed
1.00 - imgc
1.00 - bank
0.00 - 15%

The affected word was non-salaried. It removed non because it was in the stoplist. That ideally should not happen, if it is part of a phrase, which forms a single word, like the above.

arjunmenon avatar Mar 22 '18 15:03 arjunmenon