django-watson icon indicating copy to clipboard operation
django-watson copied to clipboard

Getting extremely broad search results when searching on username field

Open ianfitzpatrick opened this issue 7 years ago • 5 comments

I am running into a weird issue where when searching on a username (like [email protected]), for certain users I get extremely broad results...users that definitely do not have that phrase in their title, description, or content fields. In one case I get 7000+ results in my queryset, even though the email in question definitely only associated with one entry in my index.

To make things more confusing, some searches return as expected. If I do "[email protected]" for instance, I get exactly one results, as would be expected since username is a unique field.

Here is my app config:

class UsersAppConfig(AppConfig):
    """
    Automatically import standalone signals file once app is ready.

    Get around a circular import error otherwise facing.
    """

    name = 'users'

    def ready(self):
        import signals 
        from django.contrib.auth.models import User
        watson.register(
            User, CaseInsensitiveSearchAdapter, fields=(
                'first_name',
                'last_name',
                'username'
            )
        )

And the custom adapter I created based on some code you posted:

class CaseInsensitiveSearchAdapter(watson.SearchAdapter):

    def get_title(self, obj):
        return super(
            CaseInsensitiveSearchAdapter, self
        ).get_title(obj).lower()

    def get_description(self, obj):
        return super(
            CaseInsensitiveSearchAdapter, self
        ).get_description(obj).lower()

    def get_content(self, obj):
        return super(
            CaseInsensitiveSearchAdapter, self
        ).get_content(obj).lower()

I am using MySQL as my database. When I manually inspect the data in the index, I don't see any duplication of data. And if I do a normal contains query for "[email protected]" I only get one result.

Sorry this is not the best issue as I don't know how to provide a reduced case here. Maybe there is a forehead thunker here that sticks out though?

Thanks so much for your work on this project, it's really awesome. I'm in the process of ripping out haystack + solr with this, and if I can just get this weird case figured out it will greatly reduce the moving pieces in my system.

ianfitzpatrick avatar Apr 25 '18 06:04 ianfitzpatrick

One idea I had was, could this be some weird interaction between the @ symbol and the query used in the MySQL backend? Just a WAG, but thought I'd throw it out there.

ianfitzpatrick avatar Apr 25 '18 15:04 ianfitzpatrick

Okay I think I'm on the right track with my @ symbol theory. If I change:

backends.py RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]', re.UNICODE)

to (add an @) RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]@', re.UNICODE)

And then enclose my actual search query text in " " I get the result I am expecting, exactly one result for "[email protected]".

According to the MySQL docs this an exact phrase match I believe, relevant SO answer: https://stackoverflow.com/questions/8961148/mysql-match-against-when-searching-e-mail-addresses

I'm in a situation where I want flexibility, users can search on name or email, so in the case of email i want to do an exact match, however I want more broad results when searching on name.

I still don't get why just some particular usernames (emails) are triggering these very broad search results, where was others are not. But I can live with that if I can just work around the issue.

So I think I just need to do some pre-processing on my search text and if I detect something email like in it, auto-enclose it in quotes (my users will not have the savvy to do this themselves).

ianfitzpatrick avatar Apr 25 '18 20:04 ianfitzpatrick

Can I have a pull request to exclude that character? Sounds like a worthy bug fix.

On 25 April 2018 at 21:05, Ian Fitzpatrick [email protected] wrote:

Okay I think I'm on the right track with my @ symbol theory. If I change:

backends.py RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]', re.UNICODE)

to (add an @) RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]@', re.UNICODE)

And then enclose my actual search query text in " " I get the result I am expecting, exactly one result for "[email protected]".

According to the MySQL docs this an exact phrase match I believe, relevant SO answer: https://stackoverflow.com/questions/8961148/mysql-match- against-when-searching-e-mail-addresses

I'm in a situation where I want flexibility, users can search on name or email, so in the case of email i want to do an exact match, however I want more broad results when searching on name.

I still don't get why just some particular usernames (emails) are triggering these very broad search results, where was others are not. But I can live with that if I can just work around the issue.

So I think I just need to do some pre-processing on my search text and if I detect something email like in it, auto-enclose it in quotes (my users will not have the savvy to do this themselves).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/243#issuecomment-384416668, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCEFQf3Qxt2sMzC4WI3U8mqUQNwtSks5tsNcCgaJpZM4Ti1Iz .

etianen avatar May 17 '18 16:05 etianen

(Sorry I took so long to reply, I've been snowed under at work)

On 17 May 2018 at 17:32, Dave Hall [email protected] wrote:

Can I have a pull request to exclude that character? Sounds like a worthy bug fix.

On 25 April 2018 at 21:05, Ian Fitzpatrick [email protected] wrote:

Okay I think I'm on the right track with my @ symbol theory. If I change:

backends.py RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]', re.UNICODE)

to (add an @) RE_MYSQL_ESCAPE_CHARS = re.compile(r'["()><~*+-]@', re.UNICODE)

And then enclose my actual search query text in " " I get the result I am expecting, exactly one result for "[email protected]".

According to the MySQL docs this an exact phrase match I believe, relevant SO answer: https://stackoverflow.com/ques tions/8961148/mysql-match-against-when-searching-e-mail-addresses

I'm in a situation where I want flexibility, users can search on name or email, so in the case of email i want to do an exact match, however I want more broad results when searching on name.

I still don't get why just some particular usernames (emails) are triggering these very broad search results, where was others are not. But I can live with that if I can just work around the issue.

So I think I just need to do some pre-processing on my search text and if I detect something email like in it, auto-enclose it in quotes (my users will not have the savvy to do this themselves).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/243#issuecomment-384416668, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFCEFQf3Qxt2sMzC4WI3U8mqUQNwtSks5tsNcCgaJpZM4Ti1Iz .

etianen avatar May 17 '18 16:05 etianen

Sure thing, I'll try and get something to you next week.

ianfitzpatrick avatar May 17 '18 17:05 ianfitzpatrick