readthedocs.org Template errors result in long development/production response times

trafficstars

When a template that was included via {% include %} raises an exception, the request response time explodes. I originally noticed this in local development, but I believe that in https://github.com/readthedocs/readthedocs.org/issues/11061 we might have confirmed that this is a bug in production as well (so far maybe only affecting the beta instance).

In local development, I noticed:

Postgres container eats up 100% CPU doing what looks like repeated selects on the project table
The browser takes 60s+ to respond, or times out
The Docker logs for web container even take about 30-60s to output the stack trace

There is nothing special about my set up though, and my database is 27M.

In debugging https://github.com/readthedocs/readthedocs.org/issues/11061 I noticed:

The database query takes 20 times longer to execute
The response time is over 10s, but is not all DB time
Profiling shows most of the time spent in logging and related modules
I was not able to confirm where code time was spent in production transactions

To reproduce:

Open a template that is included in another template via {% include %}
Throw a template exception from the included template
- Remove load i18n from the template but add/retain a translated string
- Access a missing attribute, ie project.foo.bar
Load the view that would trigger the template
Notice long response time and long query time in django debug toolbar

Dec 14 '23 03:12 agjohnson

@humitos Have you noticed this at all?

Dec 14 '23 03:12 agjohnson

Yes, I've experienced this as well. My guess is because how we are formatting the exceptions in the console using structlog. I've seen the console outputting a lot of text due that all the traceback is formatted with nice colors, all the context variables and more. I found myself this pretty useful, but sometimes pretty distracting and annoying to understand the error... There is a trade-off there.

Dec 14 '23 17:12 humitos

@agjohnson are you able to reproduce the issue after commenting this line? https://github.com/readthedocs/readthedocs.org/blob/main/readthedocs/settings/base.py#L866

Dec 14 '23 17:12 humitos

Okay, well at least I'm not the only one :sweat_smile:

I found myself this pretty useful, but sometimes pretty distracting and annoying to understand the error

I really like the formatting and especially the additional context/locals information, but the stack trace is way to long to be useful usually, at least when working with templates. It's hard to find something useful in the stack trace when the exception starts in the template rendering (which was also always sort of the case with Django though).

are you able to reproduce the issue after commenting this line?

I am, though without the structlog rendering of the stack trace, the default stack trace isn't as delayed. But it still takes a long time for a response and trace, and the database is still getting hit hard with queries.

Edit: actually, the structlog change does seem to help some too, I get 504 responses very frequently instead without that change. I think the culprit is probably the DB CPU ultimately though.

Dec 14 '23 18:12 agjohnson

The database seems to be hit a project query repeatedly:

If I fix the template error and reload, this query is executed, but only 2 times and response time is ~1ms

Dec 14 '23 18:12 agjohnson

Ah I learned a new thing about debug toolbar! You can inspect/see the code and stack around the query in the debug toolbar SQL pane. This is a clue. When working properly, the query shows the template code that calls the query. When there is an exception raised, this is the stack around the query being executed repeatedly:

I disabled cacheops and the same thing happens, so it's not that layer. So seems next candidate is Django debug output?

Edit: disabling DEBUG does not fix this either. Now I'm really confused.

Dec 14 '23 18:12 agjohnson

Yet another clue:

If I edit a top level template (projects/project_dashboard.html) and remove load i18n, the response is very quick
If I edit an include from this top level file (projects/partials/project_list.html) and remove load i18n, I notice the bug and the database is hit with looped queries.

Dec 14 '23 18:12 agjohnson

I found this discussion, and hoped it was related, but so far nothing there seems to help:

https://forum.djangoproject.com/t/changing-queryset-repr-to-avoid-unexpected-queries/21272/22?page=2

I've disabled debug toolbar, logging, and debug mode entirely. This must be something related to include or local vars there.

Dec 14 '23 22:12 agjohnson

I believe https://github.com/readthedocs/readthedocs.org/issues/11061 is replicating this behavior in production, but waiting to confirm. If confirmed, we should probably get a fix for this issue on the roadmap.

Jan 31 '24 04:01 agjohnson

Noted here:

https://github.com/readthedocs/readthedocs.org/issues/11061#issuecomment-1918352645

The PR fixing 11061 confirms (to me) that this bug is affecting production as well. At very least the beta instance is affected by the bug. I haven't tested the current dashboard in production yet.

I'll update the description here, but the bug is that a template that is evaluated via {% include %} and throws a template exception will cause an abnormal response time, and will time out in production.

I've escalated this to a bug on our roadmap. Priority is medium -- the effect is really bad but it's only triggered when templates trigger exceptions.

Feb 01 '24 01:02 agjohnson

I'm not sure if it's related or not, but I think it's worth to note that we are using cached.Loader in production but not on development: https://github.com/readthedocs/readthedocs.org/blob/505ee4c3030f00bac16068ddec6b817adc8e7a54/readthedocs/settings/base.py#L397-L411

Feb 01 '24 09:02 humitos

To reproduce: * Open a template that is included in another template via {% include %} * Throw a template exception from the included template * Remove load i18n from the template but add/retain a translated string * Access a missing attribute, ie project.foo.bar * Load the view that would trigger the template * Notice long response time and long query time in django debug toolbar

I tried this by removing the load i18n from projects/partials/header.html (ext-theme) and I got a ~2.5s response: Elapsed time 2684.479 msec. It raised TemplateSyntaxError: Invalid filter: 'language_name_local'. Note that I'm --webpack --ext-theme.

Without using --webpack --ext-theme and removing the load i18n from core/project_bar.html I got a ~2.5s response as well: Elapsed time 2364.131 msec. It raised TemplateSyntaxError: Invalid block tag on line 6: 'trans', expected 'elif', 'else' or 'endif'. Did you forget to register or load this tag?

It seems I'm not able to reproduce this issue. Maybe there is something else going on here. I'm using just origin/main without any modification.

Feb 26 '24 15:02 humitos

Your results sounds about right for reproducing locally. What does the page response time look like without errors?

In production, as the database is larger, response times from this are closer to 30-60s. Locally it's more like 2-4s, for me.

I noted a template to test here I doubt it will give different response times but probably worth testing the same thing.

Also, this is less just about the response time. Noted above, but the the underlying issue seems to be a loop in Django logic and in SQL queries. You can see this on DDT:

The SQL query seems to execute over and over, I see this locally and in production

Feb 26 '24 17:02 agjohnson

I performed the test you mentioned and I have Elapsed time 4808.825 msec and default 1477.74 ms (217 queries including 212 similar and 212 duplicates )

However, without a template raising an exception: Elapsed time 742.940 msec and default 86.98 ms (101 queries including 94 similar and 62 duplicates )

Feb 27 '24 12:02 humitos

I found the issue. It's what I guessed it could be in https://github.com/readthedocs/readthedocs.org/issues/10954#issuecomment-1856227684

Disabling the "nice logging output", I get a similar CPU times and only 12 SQL queries more when it returns 500 (I suppose these are performed to show the traceback in the Django 500 debug page).

I pushed this changes to https://github.com/readthedocs/readthedocs.org/commit/592c108af4cf7678421c00b02692081bc06a020c to show how to disable this nice logger.

Feb 27 '24 12:02 humitos

Nice! I had tried a similar fix that you suggested (here though the permalink looks off), but it didn't help much. Maybe your fix above affects things at a deeper level.

We can test this rather easily. On web-extra we can replicate #11061 or something similar quite easily. With your fix, I would at least expect the response time to be better.

Feb 27 '24 16:02 agjohnson

On web-extra we can replicate #11061 or something similar quite easily

Production uses a completely different set of loggers, tho. Mainly syslog, newrelic and sentry (see https://github.com/readthedocs/readthedocs-ops/blob/4872584858e229cdb33c57dd7381f9e1e1563746/salt/base/readthedocs/settings/base.py#L522-L632) --none of those are used on development.

However, they may be being affected by the same issue. Also note that this could be a structlog or django-structlog bug. We are not using the latest versions of them: https://github.com/readthedocs/readthedocs.org/blob/8b66ce9bb035c973c1e59d1c62bd91da826b6f71/requirements/pip.in#L164-L174

We may want to read their changelogs and give it another try to upgrade those packages before going deeper and deeper on debugging this. It may have fixed already.

Related https://github.com/readthedocs/readthedocs.org/issues/10933

Feb 28 '24 12:02 humitos

We may want to read their changelogs and give it another try to upgrade those packages before going deeper and deeper on debugging this. It may have fixed already.

I created an initial PR for this change at #11166 -- we can use it to do some local QA and see if it works better.

Feb 28 '24 13:02 humitos

Production uses a completely different set of loggers, tho. Mainly syslog, newrelic and sentry (see readthedocs/readthedocs-ops@4872584/salt/base/readthedocs/settings/base.py#L522-L632) --none of those are used on development.

Thinking a little more about this, I think all those SQL queries are executed by the sentry logger propbably, since it shows the exact same traceback with context variables (Stack trace: https://read-the-docs.sentry.io/issues/5019056329/?project=148442&query=is%3Aunresolved+%21message%3A%22%21message%3A%22%21message%3A%22SystemExit%22+%21message%3A%22frame-ancestors%22&referrer=issue-stream&statsPeriod=14d&stream_index=21#exception) that the Django 500 debug page. So, I don't think we will be able to remove those queries from here unless we remove that feature from Sentry logging.

Feb 28 '24 13:02 humitos

This can be tested further on web-extra with a template change and dropping the settings for sentry et al

Feb 28 '24 17:02 agjohnson

I tried removing sentry and forcing the load i18n issue on the template, but I had problems testing this because the ext-theme is installed using pip+git and it's not installing in an editable way. I attempt to uninstall it and re-installing it, but I broke something and the instance started using the old dashboard 🙃 .

I re-deployed this instance and gave it another try and I failed again. I may leave this for you @agjohnson in case you know how to test this in production.

I edited this file lib/python3.10/site-packages/readthedocsext/theme/templates/projects/partials/project_list.html on util01.org
I removed the sentry logging from readthedocs/settings/managed.py
supervisorctl restart all

I'm getting

Bad gateway Error code 502
Visit cloudflare.com for more information

Reverting the changes in the template, the instance works again.

Mar 04 '24 16:03 humitos

I attempt to uninstall it and re-installing it, but I broke something and the instance started using the old dashboard

Yeah, I've encountered this, the package is not installed as editable. All that should be required is:

% pip uninstall readthedocsext-theme
% pip install -e ~/checkouts/ext-theme/
% # edit templates
% supervisorctl restart all

I think that should be enough to unblock you, so I'll leave you with assignment. But let me know if I can point you in the right direction if you're stuck.

Mar 04 '24 18:03 agjohnson

Yeah, that's what I did and it didn't work.

Mar 04 '24 19:03 humitos

I just tested this and it worked. :shrug:

I was able to edit the base template and the changes showed in the beta instance.

Mar 04 '24 19:03 agjohnson

I may not be sure what I'm looking for here, then. With "it didn't work" I meant that I still got Bad gateway Error code 502, but maybe that's expected since there was an error in the application?

So, let's rewind a little bit here and assume that the steps from https://github.com/readthedocs/readthedocs.org/issues/10954#issuecomment-1976941744 are correct, settings and templates changes are picked correctly. Then, what I should do? or what I should be looking at?

Mar 04 '24 21:03 humitos

I may not be sure what I'm looking for here, then. With "it didn't work" I meant that I still got Bad gateway Error code 502

I think I'm just confused by your reply. I understood "it didn't work" to mean that the application was still not starting or the new dashboard was not used. Do you have a working instance with the new dashboard when you make the change I posted above?

As long as the 502 response is a template error in the application logs, then it sounds like you have replicated the issue, no? I would also expect a very long response time before eventual timeout and Cloudflare returning the 502.

Have you had look into the issue I noted above?

#11061

That issue has the same response you are describing. The underlying cause should be an excessive response time.

I edited this file lib/python3.10/site-packages/readthedocsext/theme/templates/projects/partials/project_list.html on util01.org

Why util01.org? I assume this is incorrect and you're testing this on ext or ext-theme, correct?

I removed the sentry logging from readthedocs/settings/managed.py

I'm not quite following your updates here, but it seems to me that you have replicated the issue I've described here and removing sentry just did not fix the issue -- the template exception is still resulting in a long response time and timeout at Cloudflare.

Does this not match what you are noticing?

Mar 04 '24 22:03 agjohnson

I understood "it didn't work" to mean that the application was still not starting or the new dashboard was not used.

Oh, I see. The instance was starting and using the new dashboard properly. It was just giving 502, tho.

Do you have a working instance with the new dashboard when you make the change I posted above?

No, I was doing these changes on the only instance we have in the web-ext-theme ASG.

Why util01.org? I assume this is incorrect and you're testing this on ext or ext-theme, correct?

Yes, this was a "brain typo" 😄 . I'm testing on ext-theme.

Mar 05 '24 11:03 humitos

Settings diff I'm using:

docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/readthedocs.org$ diff -u readthedocs/settings/managed.py.orig readthedocs/settings/managed.py
--- readthedocs/settings/managed.py.orig	2024-03-05 11:12:44.246182303 +0000
+++ readthedocs/settings/managed.py	2024-03-05 11:13:21.966164143 +0000
@@ -443,33 +443,6 @@
             'formatter': 'newrelic',
         }
 
-        # Sentry logging
-        logging['formatters']['sentry'] ={
-            "()": structlog.stdlib.ProcessorFormatter,
-            "processors": [
-                structlog.stdlib.ProcessorFormatter.remove_processors_meta,
-                SentryProcessor(
-                    event_level=ERROR,
-                    as_context=False,
-                    tag_keys="__all__",
-                    ignore_loggers=[
-                        'django_structlog.celery.receivers',
-                        'django_structlog.middlewares.request',
-                        'django.security.DisallowedHost',
-                        'celery.worker.strategy',
-                        'celery.worker.consumer.mingle',
-                        'celery.worker.consumer.connection',
-                    ],
-                ),
-            ],
-            "foreign_pre_chain": shared_processors,
-         }
-        logging['handlers']['sentry'] = {
-            'level': 'ERROR',
-            'class': 'logging.handlers.SysLogHandler',
-            'formatter': 'sentry',
-        }
-
         logging['handlers']['syslog'] = {
             'level': 'INFO',
             'class': 'logging.handlers.SysLogHandler',
@@ -481,7 +454,7 @@
         }
         logging['loggers'] = {
             '': {
-                'handlers': ['console', 'syslog', 'sentry', 'newrelic'],
+                'handlers': ['console', 'syslog', 'newrelic'],
                 'level': 'DEBUG',
             },
             'django_structlog.celery.receivers': {
docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/readthedocs.org$

Theme diff:

docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/ext-theme$ git diff
WARNING: terminal is not fully functional
Press RETURN to continue 
diff --git a/readthedocsext/theme/templates/projects/partials/project_list.html b/readthedocsext/theme/templates/projects/partials/project_list.html
index 86a22f3..0e3d716 100644
--- a/readthedocsext/theme/templates/projects/partials/project_list.html
+++ b/readthedocsext/theme/templates/projects/partials/project_list.html
@@ -1,6 +1,5 @@
 {% extends "includes/crud/table_list.html" %}
 
-{% load i18n %}
 {% load humanize %}
 
 {% load privacy_tags %}
docs@web-ext-theme-i-0ae3319726327bdbb(org):~/checkouts/ext-theme$

With this configuration and running supervisorctl restart all I'm getting a 502. However, I'm not seeing anything in the logs under /var/log/readthedocs/*. Note that I'm able to access https://beta.readthedocs.org/support/ for example, which is a page that doesn't depend on the project_list.html file --so, the instance is up and running 👍🏼

Mar 05 '24 11:03 humitos

I left only console logging handler in production and I'm still receiving the 502. So, this may not be related to logging as I originally suggested/thought 🤷🏼 . I disabled logging completely with https://docs.djangoproject.com/en/5.0/ref/settings/#std-setting-LOGGING_CONFIG and still getting 502.

I ran out of ideas here 😕 . Something else may be happening here.

Mar 05 '24 11:03 humitos

Just to clarify, when you say 502, are you noticing a long response time too?

I think I debugged the build list issue with pyinstrument in production, or if not, it might be worth a shot. It's easy to plug in to Django, though I forget the exact steps.

Mar 05 '24 21:03 agjohnson

readthedocs.org readthedocs.org copied to clipboard

Template errors result in long development/production response times

readthedocs.org
readthedocs.org copied to clipboard