PeARS-orchard Non latin keywords are not getting indexed while indexing pages

Non latin keywords are not getting indexed while indexing pages

Open stultus opened this issue 6 years ago • 4 comments

I tested with a malayalam website : https://smc.org.in
I'm getting proper results for english keywords
But there are no results for malayalam keywords. even exact words are not matching (eg: രചന)

Aug 12 '18 12:08 stultus

Is this perhaps because of the boilerplate removal? What happens if you do the following:

In app/indexer/htmlparser.py, extract_from_url function:

comment the boilerplate removal and use beautifulsoup instead to get the text of the page:

#body_str = remove_boilerplates(req) body_str = ' '.join(bs_obj.findAll(text=True))
add Malayalam to the list of 'okay' languages:

if detect(title + " " + body_str) not in ["en","ml"]:

Aug 12 '18 13:08 minimalparts

This works on an all malayalam web page. eg: https://smc.org.in/fonts/ But on a page where there are mixed content (English and Malayalam), only english contents are indexed. eg: https://smc.org.in

Aug 12 '18 17:08 stultus

This is an old issue, but I think it is still relevant. @stultus, have you checked how the new version behaves wrt code-switching? My guess is: badly :(

Oct 23 '22 10:10 minimalparts

I will test and update this issue.

Oct 23 '22 10:10 stultus

PeARS-orchard PeARS-orchard copied to clipboard

Non latin keywords are not getting indexed while indexing pages

PeARS-orchard
PeARS-orchard copied to clipboard