Intermittent database errors in tests after updating to 4.2
In my main project I have two websocket consumers which performs a lot of database-related actions.
With channels==4.1.0 my unittests with django were proceeding smoothly without any problem (I have already removed any references to database_sync_to_async since this comment, when I switched to async ORM methods).
After I updated to channels==4.2.0 it started randomly (see later) throwing database errors after the async tests are executed:
django.db.utils.OperationalError: the connection is closed
I use a multiprocess testing and I get errors only in the process which is executing consumer-related tests only after those are exeuted, due to how tests are picked up by workers and how the interpreter/host handles threads this error may not occurr.
Hunting down into channels' history I found that in https://github.com/django/channels/pull/2101 it was introduced a work-around to prevent database_sync_to_async to close old connections by mocking close_old_connections calls in the communicator.
Shortly after, in https://github.com/django/channels/pull/2090 the consumers handle/disconnect got an explicit call to aclose_old_connections in their workflow.
My doubt is that there is an execution path in which the mock performed in the communicator is not effective against the db-closure in the consumer, but I need to get more evidences ti get what is going on and how to fix it (currently I had 1 day of errors on my development machine and now I no longer get this error locally but only on github.actions).
My doubt is that there is an execution path in which the mock performed in the communicator is not effective against the db-closure in the consumer...
Sounds likely from the description. Are you able to reduce it, even to an intermittent failure?
(pytest? Fixtures there have often shown these kind of errors, but I never dug in to see exactly why, since I don't use it myself.)
When I get more times to check on this I will try to introduce a bit of debug logs to check what is going on with connections and threads (currently I do not have enough tools).
I am using pytest with pytest-xdist/pytest-django to run tests on my project while I still use django-based test-cases because I prefer them to test django. pytest, in that project, is only used for its reporting capabilities which are better than django.
Apart from this I belive that the db-cleanup should have been implemented more like in django, using signals and not by putting a direct call to close_old_connections. Having two different way to handle this task in these projects seems odd, expecially because channels relies on django.
I understand that this would mean to drop support for django 4.2, because async signals were introduced in 5.0, maybe this could be done in a 5.0 release of channels?
I understand that this would mean to drop support for django 4.2, because async signals were introduced in 5.0, maybe this could be done in a 5.0 release of channels?
I think we could drop Django 4.2 from the release of 5.2, and maybe even earlier. //cc @bigfootjon
It would be good to pin down exactly what's happening here. close_old_connections shouldn't really be a problem... — the connection handler should give you a new one when you next need it... 🤔
Hi all,
Same issue here. I have a middleware for authentication and an async websocket consumer with database-related message handler to send data over the wire to a client.
However in the call to the database I get a database connection closed error.
I've also removed the database_sync_to_async decorator calls from my code to prevent closed connections but the issue is something other than this.
@francipvb make sure you read this section of the docs: https://channels.readthedocs.io/en/latest/topics/databases.html#database-connections
Particularly the part about calling close_old_connections periodically
Hello,
@francipvb make sure you read this section of the docs: https://channels.readthedocs.io/en/latest/topics/databases.html#database-connections
Particularly the part about calling
close_old_connectionsperiodically
I've read this documentation section and the consumers code directly without any success. The code just fails during tests without any good reason.
The database related code fails even inside a websocket middleware.
Do you have a sample project that reproduces the issue? Or sample code at least? It's hard to debug without the code and reproduction steps
Hello,
I will try to give you a sample project for this case.
Anyway I solved all connection related errors by writing a TransactionTestCase insthead of a TestCase.
Here is a sample project which I used for an other issue with channels and django TestCase.
I have setup a workflow which tests against channels 4.1.0 and 4.2.0, python 3.10-12 and using postgres or sqlite. Here are the results: https://github.com/sevdog/test-websocker-user-lock/actions/runs/14088653132
When I run against postgres I get the error django.db.utils.OperationalError: the connection is closed.
With the current setup I always get the error, while before I was getting it in a more randomic way (dunno it is due to pycopg or django version).
Digging deeper, also adding a bit of stupid print statements (because if I try to use a debugger it will just get errors due to timeout) I have found the following hint:
# class/method threading.get_ident()
ApplicationCommunicator.send_input 133367373362880
ApplicationCommunicator.receive_output 133367373362880
AsyncConsumer.dispatch 133367373362880
aclose_old_connections <function no_op at 0x77c1fb000860> 133367373362880
[print statements for the connection phase, 2 threads execute code]
ApplicationCommunicator.send_input 133367373362880
AsyncConsumer.dispatch 133367373362880
aclose_old_connections <function close_old_connections at 0x77c1fb01d260>> 133367373362880
Even if everithing seems to be kept in the same thread, the first time it is executed the mock is applied while it is not in the second.
Goind deeper I found that the no_op method was the one setup in the receive_output method, because WebsocetCommunicator.connect calls it very shortly after sending data.
I belive that this may be bound to this issue https://github.com/python/cpython/issues/114033 because mock.patch is a sync context manager and thus if used in an async context may not work as intended. An other reason to suspect this is the cause is my previous finding: the patch being applied to close_old_connection is that configured in the receive_output rather than that in send_input whereas the aclose_old_connections is called inside AsyncCommunicator.dispatch which is an input method.
IMO the current unittest.mock is not suitable for this task and the current implementation fails againsta a real database.
Digging deeper, also adding a bit of stupid
# class/method threading.get_ident() ApplicationCommunicator.send_input 133367373362880 ApplicationCommunicator.receive_output 133367373362880 AsyncConsumer.dispatch 133367373362880 aclose_old_connections <function no_op at 0x77c1fb000860> 133367373362880 [print statements for the connection phase, 2 threads execute code] ApplicationCommunicator.send_input 133367373362880 AsyncConsumer.dispatch 133367373362880 aclose_old_connections <function close_old_connections at 0x77c1fb01d260>> 133367373362880Even if everithing seems to be kept in the same thread, the first time it is executed the mock is applied while it is not in the second.
Goind deeper I found that the
no_opmethod was the one setup in thereceive_outputmethod, becauseWebsocetCommunicator.connectcalls it very shortly after sending data.I belive that this may be bound to this issue python/cpython#114033 because
mock.patchis a sync context manager and thus if used in an async context may not work as intended. An other reason to suspect this is the cause is my previous finding: the patch being applied toclose_old_connectionis that configured in thereceive_outputrather than that insend_inputwhereas theaclose_old_connectionsis called insideAsyncCommunicator.dispatchwhich is an input method.IMO the current
unittest.mockis not suitable for this task and the current implementation fails againsta a real database.
Hello,
Is there a PR to fix this already?
Thanks
https://justinmayer.com/posts/any-updates/
Here is just an other side of the issue which IMO highlights how the mock implementation is not the right way to handle this.
Mocking close_old_connections in either send_input/receive_output may fail even if the sync context manager works as expected in an async context. This because either methods just awaits on a queue (which acts as a dummy broker) and not on the methods of the consumer. These are two parallel flows and thus there is no guarantee of the execution order:
Communicator.send_input->Queue.putawait_many_dispatch->Queue.get->Consumer.dispatch
Since everything is handled in threads this may work or not according to how the interpreter and the OS handle sheduling. TL;DR: Consumer.dispatch may be called after Communicator.send_input has returned.
Also we need to remember that a consumer may send a message to an other channel where an other consumer may be waiting and during tests this may cause even more problems.
It would be good to have some kind of semaphore which tells to the channels.db package something like "we are running tests, let the connection status to be handled by test runner".