PeARS-orchard
PeARS-orchard copied to clipboard
Non latin keywords are not getting indexed while indexing pages
- I tested with a malayalam website : https://smc.org.in
- I'm getting proper results for english keywords
- But there are no results for malayalam keywords. even exact words are not matching (eg: രചന)
Is this perhaps because of the boilerplate removal? What happens if you do the following:
In app/indexer/htmlparser.py, extract_from_url function:
-
comment the boilerplate removal and use beautifulsoup instead to get the text of the page:
#body_str = remove_boilerplates(req)
body_str = ' '.join(bs_obj.findAll(text=True))
-
add Malayalam to the list of 'okay' languages:
if detect(title + " " + body_str) not in ["en","ml"]:
This works on an all malayalam web page. eg: https://smc.org.in/fonts/ But on a page where there are mixed content (English and Malayalam), only english contents are indexed. eg: https://smc.org.in
This is an old issue, but I think it is still relevant. @stultus, have you checked how the new version behaves wrt code-switching? My guess is: badly :(
I will test and update this issue.