backend icon indicating copy to clipboard operation
backend copied to clipboard

replace 're' module with 'regex' module

Open hroberts opened this issue 5 years ago • 2 comments

the python default 're' module does not recognize words correctly in hindi. we should just replace 're' with 'regex' everywhere for easy consistency so this doesn't bite us again.

note that we may still get hit with this where packages we import use the 're' module, but we'll just have to deal with those as we find them.

hroberts avatar Feb 07 '19 16:02 hroberts

Matching words with \w wouldn't work in all the cases anyways, and will lead us to using various nasty hacks, e.g. is_logogram. This has became even more important now that I occasionally hear shy talks about adding Arabic support and improving Chinese.

Also, Python's tokenize library is designed to work with Python source code, not human language, so we're using the wrong tool for the job here.

We should be using our language modules' methods, possibly by adding a few more of those for Solr query tokenization, to do natural language processing instead.

pypt avatar Feb 21 '19 12:02 pypt

The updated code does not use the tokenize module at all. It just uses a regex including \w and punctuation relevant to solr queries:

https://github.com/berkmancenter/mediacloud/blob/master/mediacloud/mediawords/solr/query.py#L696

The solr queries are much closer to a programming language than human language, and we don't know what (human) language any given query is in. We just have the query itself. I strongly prefer just keeping the existing, simple code that is working well now. If it breaks when trying to implement arabic support, we can address it then, but the regex module should be able to handle arabic words fine.

The is_logogram hack is not directly related to the query tokenization. It is required to decide whether to look for a word boundary before each term in the solr-query-derived-regex that is used to determine relevancy of spidered topic stories.

-hal

On Thu, Feb 21, 2019 at 6:30 AM Linas Valiukas [email protected] wrote:

Matching words with \w wouldn't work in all the cases anyways, and will lead us to using various nasty hacks, e.g. is_logogram. This has became even more important now that I occasionally hear shy talks about adding Arabic support and improving Chinese.

Also, Python's tokenize library is designed to work with Python source code, not human language, so we're using the wrong tool for the job here.

We should be using our language modules' methods, possibly by adding a few more of those for Solr query tokenization, to do natural language processing instead.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_545-23issuecomment-2D465982591&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=osfeWh3YeF3ZeWqaXO7S33EF1SdfBlIqcIsPgjQVJcQ&s=wJEvhSd45VNUZg-0YpqeYGMbzzHsPUNZeElFsoB21q4&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABvvT-2DkxAq4hCnEI2P6zzzA1Aw88zPQJks5vPpF1gaJpZM4anrck&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=osfeWh3YeF3ZeWqaXO7S33EF1SdfBlIqcIsPgjQVJcQ&s=Mw3CwvESEeF139BxKc9oXDuvNqdVe7ysVp7j_JPHsYM&e= .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

hroberts avatar Feb 21 '19 14:02 hroberts