PeARS-orchard icon indicating copy to clipboard operation
PeARS-orchard copied to clipboard

Non latin keywords are not getting indexed while indexing pages

Open stultus opened this issue 6 years ago • 4 comments

  • I tested with a malayalam website : https://smc.org.in
  • I'm getting proper results for english keywords
  • But there are no results for malayalam keywords. even exact words are not matching (eg: രചന)

stultus avatar Aug 12 '18 12:08 stultus

Is this perhaps because of the boilerplate removal? What happens if you do the following:

In app/indexer/htmlparser.py, extract_from_url function:

  • comment the boilerplate removal and use beautifulsoup instead to get the text of the page:

    #body_str = remove_boilerplates(req) body_str = ' '.join(bs_obj.findAll(text=True))

  • add Malayalam to the list of 'okay' languages:

    if detect(title + " " + body_str) not in ["en","ml"]:

minimalparts avatar Aug 12 '18 13:08 minimalparts

This works on an all malayalam web page. eg: https://smc.org.in/fonts/ But on a page where there are mixed content (English and Malayalam), only english contents are indexed. eg: https://smc.org.in

stultus avatar Aug 12 '18 17:08 stultus

This is an old issue, but I think it is still relevant. @stultus, have you checked how the new version behaves wrt code-switching? My guess is: badly :(

minimalparts avatar Oct 23 '22 10:10 minimalparts

I will test and update this issue.

stultus avatar Oct 23 '22 10:10 stultus