sentry-python icon indicating copy to clipboard operation
sentry-python copied to clipboard

'User' object has no attribute 'is_verified'

Open omerfarukabaci opened this issue 1 year ago • 23 comments

Environment

SaaS (https://sentry.io/)

Steps to Reproduce

Versions:

  • Django: 4.2.15
  • uwsgi: 2.0.24
  • sentry-sdk: 2.13.0
  • django-otp: 1.5.1
  1. Setup django-otp: https://django-otp-official.readthedocs.io/en/stable/overview.html#installation

  2. Use django_otp.admin.OTPAdminSite as custom admin site.

    # urls.py
    from django_otp.admin import OTPAdminSite
    
    admin_site = OTPAdminSite()
    urlpatterns = [path("admin/, admin_site.urls),]
    
  3. Enable profiling by setting profiles_sample_rate in sentry_sdk.init.

  4. When the uwsgi server starts, the first request to the admin fails with the error:

    'User' object has no attribute 'is_verified'

  5. If you remove the profiles_sample_rate from sentry_sdk.init, the error goes away.

Expected Result

Profiling should not change the behaviour of middleware so I would expect django_otp.middleware.OTPMiddleware to set the is_verified attribute on the User object successfully.

Actual Result

django_otp.middleware.OTPMiddleware somehow cannot set User.is_verified for the first request in the admin. Consequent calls works fine, but only the initial request to admin fails.

Product Area

Profiling

Link

No response

DSN

No response

Version

No response

omerfarukabaci avatar Aug 26 '24 09:08 omerfarukabaci

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] avatar Aug 26 '24 09:08 getsantry[bot]

@omerfarukabaci, I am having trouble reproducing your issue from the information you have provided. Could you please provide your source code (if it is public), or create a minimal reproduction that you can share with me?

szokeasaurusrex avatar Aug 26 '24 13:08 szokeasaurusrex

Hey @szokeasaurusrex, thank you for your quick answer! Unfortunately our source code is not public. I have tried to create a minimal reproduction but couldn't manage to do that, since the issue is flaky and does not happen all the time. But I have some events on Sentry, would that work? In the traces of those events, it can be seen that OTPMiddleware has been called but is_verified attribute is not set.

omerfarukabaci avatar Aug 27 '24 08:08 omerfarukabaci

Hey @omerfarukabaci, any additional info would help, even just the full stack trace. Sentry events too -- feel free to put the links to them here or if you prefer, send them to me on Discord (sentrivana) or to my email (ivana.kellyer @ sentry.io). Thanks!

sentrivana avatar Aug 27 '24 10:08 sentrivana

Hey @sentrivana, I have sent Sentry event links to your email. There is one other issue that started appearing after enabling profiling, I am not 100% sure that it is directly related with profiling but seems like it, also added link to that to the email just in case.

If you need further assistance on my side please let me know, I can override the OTPAdminSite and retrieve extra information in case of this exception. I can also provide uwsgi configuration file if that helps.

Thanks a lot for your prompt responses! 🚀

omerfarukabaci avatar Aug 27 '24 10:08 omerfarukabaci

Thanks a lot for your prompt responses! 🚀

Sure thing, that's why we're here :) Thanks for the quick follow ups!

I have sent Sentry event links to your email.

So I've taken a look at the Sentry events and while the stacktraces provide some extra context, I'm still not sure what's going on, especially why profiling specifically would be causing this.

The facts we have so far:

  • django-otp patches the User object here, adding is_verified on it
  • Django tries to read is_verified here with _wrapped being a User instance and fails, indicating the middleware has not run or the user object was somehow overwritten after the fact
  • The error only happens on the first request
  • Turning off profiling gets rid of the issue
    • The profiler itself doesn't do any additional patching so it shouldn't be the case that it's overwriting the django-otp patch
    • The only thing I can think of why the profiler being on would matter is that it takes a snapshot of the current frame stack periodically -- maybe this leads to something being prematurely evaluated that shouldn't be? (This is kinda far fetched)
  • Setting the profiler aside, our Django integration does an extra call to User.is_authenticated here. I'm wondering if this has anything to do with the issue.

Some follow-ups:

  • Just to cover the bases, django-otp is properly in your INSTALLED_APPS and more importantly MIDDLEWARES, right?
  • Could you show me your sentry_sdk.init?

There is one other issue that started appearing after enabling profiling, I am not 100% sure that it is directly related with profiling but seems like it, also added link to that to the email just in case.

I had a quick look but don't have many insights -- if you keep seeing this and it indeed looks like a profiler/SDK issue, can you please open a new issue just with that? Makes it easier for us to keep track. :)

sentrivana avatar Aug 27 '24 13:08 sentrivana

@sentrivana Thanks for the detailed response!

Some comments about the facts:

The error only happens on the first request

Now I have realized that this is not the case, sorry for the initial misinformation. To test this I have opened up our QA and Sandbox admin pages and refreshed them tens of times, and eventually I could see the same error again, so it doesn't only happen on the first request but happens occassionally.

Turning off profiling gets rid of the issue

I have checked the deployments just before encountering this issue for the first time, and the issue is introduced after a deployment that includes:

  • Upgrading sentry-sdk from 1.40.6 to 2.7.1
  • Setting profiles_sample_rate=0.3

Then the issue disappeared after I have removed profiles_sample_rate along with some other changes:

  • Setting py-call-uwsgi-fork-hooks=true in uwsgi conf (enable-threads was already true).
  • Upgrading sentry-sdk from 2.7.1 to 2.13.0

Then one week after not seeing the issue, I have enabled profiling again, then immediately we have started seeing this issue again, which is why I strongly suspect that this is related with profiling.

The only thing I can think of why the profiler being on would matter is that it takes a snapshot of the current frame stack periodically -- maybe this leads to something being prematurely evaluated that shouldn't be? (This is kinda far fetched)

Is this sampling being executed in another thread? If you look at the traces of the issues hapenning in the same page with ids:

  • 9c4203a666d24a848f1980f236a48671
  • 4fd9a500f7a749109e9fbaea88e42232

You can see that the error is happening at different times. Also the span that raises the error seems like: event.django - sentry_sdk.integrations.django._got_request_exception, which is something I don't see in other issues, it is normally a span related with our code that raises the error. Can there be a race condition during the evaluation of the lazy request.user object between Sentry and django-otp? What I mean is that, could this line be somehow overriding the request.user that has been patched by django-otp, maybe after the evaluation of request.user required by profiling?

Follow-ups

Just to cover the bases, django-otp is properly in your INSTALLED_APPS and more importantly MIDDLEWARES, right?

Yes, copy & pasting from the settings source code:

INSTALLED_APPS = [
    ...,
    "django_otp",
    "django_otp.plugins.otp_static",
    "django_otp.plugins.otp_totp",
]

MIDDLEWARE = [
    m
    for m in [
        ...
        ("django_otp.middleware.OTPMiddleware" if ROLE == "admin" else None),
    ]
    if m
]

I thought that maybe we somehow cannot set or change the ROLE but you can see that OTPMiddleware is being called in the trace, so it shouldn't be the case.

Could you show me your sentry_sdk.init

Copy & pasting the whole sentry config from the settings source code:

import os
import re

SENTRY_DSN = os.environ.get("SENTRY_DSN")
SENTRY_ENV = os.environ.get("SENTRY_ENV")

if SENTRY_DSN:
    if not SENTRY_ENV:
        raise RuntimeError(
            "SENTRY_ENV env var needs to be defined (use role name, such as 'admin' or 'buyer_console')"
        )

    import sentry_sdk
    from sentry_sdk.integrations.django import DjangoIntegration

    def traces_sampler(sampling_context):
        path = sampling_context.get("wsgi_environ", {}).get("PATH_INFO")
        method = sampling_context.get("wsgi_environ", {}).get("REQUEST_METHOD")
        if path == "/favicon.ico":
            return 0
        if path.startswith("/static/"):
            return 0

        if path == "/health/":
            return 0.001

        if re.match(r"/v1\/api\/some-path\/\S+\/$", path) and method == "GET":
            return 0.001
        if path == "/v1/api/token/":
            return 0.001

        return 1

    sentry_sdk.init(
        dsn=SENTRY_DSN,
        integrations=[DjangoIntegration()],
        send_default_pii=True,
        traces_sampler=traces_sampler,
        profiles_sample_rate=1.0,
        environment=SENTRY_ENV,
    )

omerfarukabaci avatar Aug 27 '24 15:08 omerfarukabaci

thanks @omerfarukabaci, just to rule another possibility out first, you are setting the Authentication middleware like this too before the OTP one right? https://stackoverflow.com/a/77109153

# "settings.py"

MIDDLEWARE = [
    ...
    'django.contrib.auth.middleware.AuthenticationMiddleware',
    'django_otp.middleware.OTPMiddleware' # Here
]

sl0thentr0py avatar Aug 28 '24 14:08 sl0thentr0py

Heyt @sl0thentr0py, you're welcome! Yes, it is set as following:

MIDDLEWARE = [
        ...
        "django.contrib.auth.middleware.AuthenticationMiddleware",
        ("django_otp.middleware.OTPMiddleware" if ROLE == "admin" else None),
]

omerfarukabaci avatar Aug 29 '24 08:08 omerfarukabaci

Thanks for helping us investigate, this is very much appreciated! ❤

Is this sampling being executed in another thread? If you look at the traces of the issues hapenning in the same page with ids:

9c4203a666d24a848f1980f236a48671
4fd9a500f7a749109e9fbaea88e42232

You can see that the error is happening at different times. Also the span that raises the error seems like: event.django - sentry_sdk.integrations.django._got_request_exception, which is something I don't see in other issues, it is normally a span related with our code that raises the error. Can there be a race condition during the evaluation of the lazy request.user object between Sentry and django-otp? What I mean is that, could this line be somehow overriding the request.user that has been patched by django-otp, maybe after the evaluation of request.user required by profiling?

This is sort of the direction I was thinking. One thing to test whether this has anything to do with the SDK accessing the user is to set send_default_pii=False for a bit and enabling the profiler. Setting send_default_pii to False should disable the code path that leads to us capturing request.user. Could you try that out and let us know if the error still happens?

sentrivana avatar Sep 02 '24 08:09 sentrivana

Thanks for helping us investigate, this is very much appreciated! ❤

Is this sampling being executed in another thread? If you look at the traces of the issues hapenning in the same page with ids:

9c4203a666d24a848f1980f236a48671
4fd9a500f7a749109e9fbaea88e42232

You can see that the error is happening at different times. Also the span that raises the error seems like: event.django - sentry_sdk.integrations.django._got_request_exception, which is something I don't see in other issues, it is normally a span related with our code that raises the error. Can there be a race condition during the evaluation of the lazy request.user object between Sentry and django-otp? What I mean is that, could this line be somehow overriding the request.user that has been patched by django-otp, maybe after the evaluation of request.user required by profiling?

This is sort of the direction I was thinking. One thing to test whether this has anything to do with the SDK accessing the user is to set send_default_pii=False for a bit and enabling the profiler. Setting send_default_pii to False should disable the code path that leads to us capturing request.user. Could you try that out and let us know if the error still happens?

You're welcome, it is my pleasure, and also thank you for your concern! ❤️ Unfortunately I am on vacation for 3 weeks, I will try that and let you know when I am back. 🚀

omerfarukabaci avatar Sep 02 '24 08:09 omerfarukabaci

Hey @sentrivana, @sl0thentr0py

I am finally back. 🙂 My colleagues changed send_default_pii to False, but unfortunately the issue still persists. Anything else I can do to investigate this further?

omerfarukabaci avatar Sep 23 '24 15:09 omerfarukabaci

Hey @omerfarukabaci, welcome back 👋🏻

So much for my hypothesis 😢 So it seems like it doesn't have anything to do with the SDK accessing the user data (at least not the _set_user_info part) since that code path shouldn't be triggered at all if PII sending is False.

Based on how inconsistently it happens it indeed looks like a possible race condition. I understand that it's super fickle to repro but I don't have any ideas how to debug this further without having an example app that runs into the issue.

sentrivana avatar Sep 24 '24 13:09 sentrivana

Hey @sentrivana, thank you! 👋🏼

Since this issue happens everyday on our QA environment, we are actually able to reproduce it. If you could direct me about what else you need to know I can update the code on the QA to gather more information. I can also fork SDK to manipulate its source code to learn more about the problem. Trying to reproduce it in a separate app is also possible, but it requires more work since we don't have much idea about the actual reason.

omerfarukabaci avatar Sep 26 '24 11:09 omerfarukabaci

So the problem as far as I can tell is either that request.user is modified/overwritten after the OTP middleware has patched it, or the middleware doesn't even run (not that likely).

What we could look at, during a request when the error happens:

  • Can we verify that the django-otp middleware has actually run and patched the user?
  • Can we pinpoint when the user object has changed, or at least narrow down points in time when it looks correct vs. when it's already wrong? (E.g. check what it looks like in the profiler before and after it's taken a sample since this is our most likely culprit at the moment.)

What would also be interesting would be comparing this to a working request.

sentrivana avatar Sep 27 '24 09:09 sentrivana

Hey @sentrivana,

What do you think about the following middleware that monkey-patches request object's getter and setter, and logs the traceback if user attribute has been changed or accessed:

import logging
import traceback

LOGGER = logging.getLogger(__name__)


class CustomRequestMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def log_traceback_for_user_action(self, action, attr):
        if attr == "user":
            LOGGER.info(f"[NDV-1622] request.user is being {action} by:")
            for frame in traceback.extract_stack():
                LOGGER.info(f"\t[NDV-1622] {frame.filename}:{frame.lineno} {frame.name}")

    def __call__(self, request):
        def custom_setattr(obj, attr, value):
            self.log_traceback_for_user_action("set", attr)
            return object.__setattr__(obj, attr, value)

        def custom_getattr(obj, attr):
            self.log_traceback_for_user_action("get", attr)
            return object.__getattribute__(obj, attr)

        request.__class__.__setattr__ = custom_setattr
        request.__class__.__getattribute__ = custom_getattr

        return self.get_response(request)

I am planning to add this as the first middleware, and deploy to QA and try to reproduce the issue. Then I will look for any sign of sentry-sdk. I know it is a bit tricky but I think that could work, also I am kinda out of options.

omerfarukabaci avatar Oct 01 '24 16:10 omerfarukabaci

@omerfarukabaci Nice! Yeah I think that could give us a lot of insight, let's try it.

sentrivana avatar Oct 02 '24 14:10 sentrivana

Hey @sentrivana,

Finally, I found some time to deploy the middleware above to QA. I have sent two log files for both a successful and a failed request to your email address ([email protected]). Unfortunately it still doesn't make any sense to me. 😕 I hope it somehow gives you some insight.

If you could think of anything else I can do, please let me know.

omerfarukabaci avatar Oct 07 '24 15:10 omerfarukabaci

Thanks @omerfarukabaci, I've had a cursory look. Nothing jumps out at me right off the bat, can't find any substantial difference between a successful request and a failed one, but will diff this more properly in the next couple of days and report back.

sentrivana avatar Oct 08 '24 13:10 sentrivana

Thank you @sentrivana, waiting for your response! 🚀

omerfarukabaci avatar Oct 08 '24 15:10 omerfarukabaci

I took a look at the outputs and here are some more observations. I'm still not sure it's a sentry problem but it could be that just pausing every 10 ms for collecting profiler samples could be making another race condition between django and django-otp trigger more often.

  • this getattr in _set_user_info is completely missing from both file outputs so I'm like 90% sure this is not about Sentry's django integration and that user fetching code.
  • we need to find out when this _verify_user does not get called since this is the lazy code that actually sets the is_verified partial.

also after reading this code, imo this partial is_verified is kinda bad design since it is a method that is set lazily so at least they should check for its presence when calling it or not set it lazily.

sl0thentr0py avatar Oct 09 '24 14:10 sl0thentr0py

Hey @sl0thentr0py, thank you for your insights! I totall agree with your points.

I was wondering that, to make sure it is (or it isn't) related with Sentry profiling, can I manually call the code that collects profiler samples? What I am planning to do is to call that function to cause the error happen all the time instead of random. I can also use a fork of the SDK to manipulate the source code, although I need a bit of help here. What do you think?

omerfarukabaci avatar Oct 09 '24 15:10 omerfarukabaci

Hey @omerfarukabaci

I just read the whole thread and I would like to help.

I was wondering that, to make sure it is (or it isn't) related with Sentry profiling, can I manually call the code that collects profiler samples? What I am planning to do is to call that function to cause the error happen all the time instead of random. I can also use a fork of the SDK to manipulate the source code, although I need a bit of help here. What do you think?

There is global var _scheduler that has an object of type ThreadScheduler (or GeventScheduler depending on your setup). You can call _scheduler.sampler() which should take a sample of the stack. But this function is called on a regular basis when the profiler is enabled, so I am not sure if this helps.

Other things that came to my mind:

If the problem is that the is_verified is not there you could check:

  • Is there any other middleware that messes with the user object?
  • Do you have a field is_verified in your User model that clashes with the is_verified property of django-otp? (maybe user is saved with is_verified=None and thus the is_verified partial is deleted?)

antonpirker avatar Oct 10 '24 07:10 antonpirker

Hello @antonpirker, sorry for my late response, things have been hectic for a while.

Do you have a field is_verified in your User model that clashes with the is_verified property of django-otp? (maybe user is saved with is_verified=None and thus the is_verified partial is deleted?)

No, neither a model field nor another attribute is_verified defined in the User model.

Is there any other middleware that messes with the user object?

Well we have some middleware that uses some global variables, and one gets request.user, but it is after the OTPMiddleware is executed. I have also tried to remove this middleware but the issue still persists.

I have also tried manually calling _scheduler.sampler() with a custom middleware before and after OTPMiddleware but no chance, couldn't reproduce the issue.


I might have something new. After investigating sentry_sdk source code for profiling I have realized the sampler (used within _scheduler.sampler()) is calling extract_stack. This function utilizes an LRUCache, and caches the frame object, which includes f_locals and f_globals dictionaries. Considering threads are using shared memory, while profiling can Sentry's thread somehow be overriding request.user of the application thread with the cached request.user which is not manipulated by OTPMiddleware yet?

I also wonder that if sentry-sdk uses threading by default, or is it only for profiling?

omerfarukabaci avatar Oct 24 '24 11:10 omerfarukabaci

cc @Zylphrex can the cached frame somehow cause state race conditions in weird scenarios?

sl0thentr0py avatar Oct 24 '24 12:10 sl0thentr0py

To the best of my knowledge, no. This is because we only cache the extracted attributes of a frame meaning the actual frame object itself is not stored between sampling.

One idea that might shed some light on this issue is to inject a middleware

  1. immediately after django_otp.middleware.OTPMiddleware
  2. at the end of the list of middleware

to check if is_verified exists on the user object

Perhaps something along the lines of

import logging


LOGGER = logging.getLogger(__name__)


class LogIsVerifiedMiddleware:
    def __init__(self, get_response=None):
        self.get_response = get_response

    def __call__(self, request):
        user = getattr(request, 'user', None)

        if user is None:
            LOGGER.info("No user found")
        else:
            has_is_verified = hasattr(user, "is_verified")
            if has_is_verified:
                LOGGER.info("User has is_verified")
            else:
                LOGGER.info("User does not have is_verified")

        return self.get_response(request)

This will help confirm at which point the method disappears or if it was ever there.

Zylphrex avatar Oct 24 '24 18:10 Zylphrex

Hey @Zylphrex, thank you for your answer! 🚀

The middleware you have provided will cause lazy request.user object to be evaluated, so I am not sure if it would work as expected or not. But I am kinda sure that it is there after all pre-response middleware hooks are called. Below is the call stack for the error:

django/utils/functional.py in inner at line 268

django_otp/admin.py in has_permission at line 69

django/contrib/admin/sites.py in each_context at line 336

django/contrib/admin/sites.py in index at line 555

django/contrib/admin/sites.py in inner at line 242

django/views/decorators/cache.py in _wrapped_view_func at line 62

django/utils/decorators.py in _wrapped_view at line 133

django/contrib/admin/sites.py in wrapper at line 261

django/core/handlers/base.py in _get_response at line 197

django/core/handlers/exception.py in inner at line 55

If you look at django/contrib/admin/sites.py:inner, you can see that before this line is executed, has_permission already called once without any errors. The error occurs when we call has_permission again in this line, which is really weird.

I will again disable profiling for a while to make sure this is really a Sentry issue.

omerfarukabaci avatar Oct 25 '24 09:10 omerfarukabaci

I will again disable profiling for a while to make sure this is really a Sentry issue.

Thanks @omerfarukabaci, let us know once you have more info

szokeasaurusrex avatar Oct 28 '24 09:10 szokeasaurusrex

Hey @szokeasaurusrex,

Sentry profiling has been disabled for the last 4 days, and the issue has gone away again. We have one other issue that appeared after enabling profiling, I didn't mention it previously because I wasn't sure that if it is related with Sentry or not, but now looking at the stack trace in detail, it is stemming from the Sentry source code for profiling, and looks like it has something to do with lazy User object:

Image

At the end this causes an error in MySQL:

Image

Looking at the above stack trace, I am pretty sure that Sentry profiling is evaluating the lazy User object at some point. The issue ID is 5967530499, if this is not enough just let me know, I can provide you the full link.

Sorry for not bringing up this issue previously, I didn't see (or forgot) that it is from a Sentry thread. 🙂 🚀

omerfarukabaci avatar Oct 29 '24 13:10 omerfarukabaci

@omerfarukabaci The link would be helpful :)

szokeasaurusrex avatar Oct 29 '24 14:10 szokeasaurusrex