musicbrainz-docker
musicbrainz-docker copied to clipboard
Redis connection errors
Hello! I have increased the number of workers for the webserver and run the services through docker-compose, while sending a lot of API requests. After some time, I start getting errors like these in the logs of the server:
musicbrainz_1 | 2021-11-16T13:52:22.714329479Z ...propagated at /root/perl5/lib/perl5/Redis.pm line 613, <PKGFILE> line 1."
musicbrainz_1 | 2021-11-16T13:52:26.429439298Z [error] Caught exception in engine "Could not connect to Redis server at redis:6379: Cannot assign requested address at lib/MusicBrainz/Redis.pm line 24.
musicbrainz_1 | 2021-11-16T13:52:26.429469986Z ...propagated at /root/perl5/lib/perl5/Redis.pm line 613, <PKGFILE> line 1."
musicbrainz_1 | 2021-11-16T13:52:36.652664437Z [error] Caught exception in MusicBrainz::Server::Controller::WS::2::Work->load "Could not connect to Redis server at redis:6379: Cannot assign requested address at /root/perl5/lib/perl5/Redis.pm line 275.
musicbrainz_1 | 2021-11-16T13:47:21.205178216Z [error] Caught exception in MusicBrainz::Server::Controller::WS::2::Recording->load "Could not connect to Redis server at redis:6379: Cannot assign requested address at /root/perl5/lib/perl5/Redis.pm line 275.
musicbrainz_1 | 2021-11-16T13:47:21.205218285Z ...propagated at /root/perl5/lib/perl5/Redis.pm line 613, <PKGFILE> line 1."
musicbrainz_1 | 2021-11-16T13:47:22.142083445Z [error] Caught exception in MusicBrainz::Server::Controller::WS::2::Recording->load "Could not connect to Redis server at redis:6379: Cannot assign requested address at /root/perl5/lib/perl5/Redis.pm line 275.
musicbrainz_1 | 2021-11-16T13:47:22.142122078Z ...propagated at /root/perl5/lib/perl5/Redis.pm line 613, <PKGFILE> line 1."
Any idea/suggestion on how to handle this?
Hi!
It might be the Redis instance is over-solicited. You may need to get your hands dirty at tuning the configuration of your Redis instance. It can probably be achieved by passing options on the command-line through a local Docker Compose override file.
Here are the options we pass to our Redis instance for cache at musicbrainz.org
:
--maxmemory 1GB --maxmemory-policy allkeys-lru --save ""
To pass these options to the command-line, please read the quick how-to I wrote about Docker Compose Overrides and adapt Modify memory settings to your specific needs which look like services
> redis
> command
> redis --maxmemory…
.
@mwiencek: Since you are more knowledgeable than me about Redis use in MusicBrainz, can you please double-check both the reported issue (for a potential bug to be fixed in musicbrainz-server
) and my answer (for potential misconceptions)?
Thanks @yvanzo for your reply. I will try this
It seems that the issue was not specific for Redis. Even after disabling redis entirely, I kept getting a similar error for Postgres. The problem is that because of the large number of requests that I was sending, the OS of the MusicBrainz container could not create a new socket between itself and the other services.
I noticed with netstat
that a huge amount of TIME_WAIT
connections was there not allowing new connections to be created.
I resolved this by changing tcp_max_tw_buckets
in the MusicBrainz docker image and now the services are able to run with approximately 100 web workers in parallel without "Cound not connect"
errors.
Ideally, this could be resolved at the application level, by reusing the connections it is creating (eg use connection pooling for Postgres)
More information about the issue https://www.percona.com/blog/2014/12/08/what-happens-when-your-application-cannot-open-yet-another-connection-to-mysql/
I could open a Pull Request with the change in the docker-compose.yml
if you think that this may be useful in other cases.
Ideally, this could be resolved at the application level, by reusing the connections it is creating (eg use connection pooling for Postgres)
In production, we use pgbouncer
. Would it be worth including it in musicbrainz-docker
too?
I resolved this by changing
tcp_max_tw_buckets
in the MusicBrainz docker image and now the services are able to run with approximately 100 web workers in parallel without "Cound not connect" errors. I could open a Pull Request with the change in thedocker-compose.yml
if you think that this may be useful in other cases.
Thanks, if that would be complementary to Postgres connection pooling, yes.
Ideally, this could be resolved at the application level, by reusing the connections it is creating (eg use connection pooling for Postgres)
In production, we use
pgbouncer
. Would it be worth including it inmusicbrainz-docker
too?
Maybe it would, even as an optional part.
I resolved this by changing
tcp_max_tw_buckets
in the MusicBrainz docker image and now the services are able to run with approximately 100 web workers in parallel without "Cound not connect" errors. I could open a Pull Request with the change in thedocker-compose.yml
if you think that this may be useful in other cases.Thanks, if that would be complementary to Postgres connection pooling, yes.
Yes, it can be complementary to the pooling and it will also help avoid issues with Redis. I will open a PR shortly.
By the way @yvanzo do you know if you use a non-default value for --max-keepalive-reqs
and --keepalive-timeout
in Starlet?
Setting them also helped me reduce the amount of open sockets a bit..
@nikosmichas have you figured out how to modify the local/compose/memory-settings.yml file to accomplish this? I believe this is working for me
version: '3.1'
# Description: Customize memory settings
services:
redis:
command: redis-server --maxmemory 1GB --maxmemory-policy allkeys-lru --save ""
using the values above cause my slave server to sometimes timeout for MQ queue or checking the index count vs the DB. I have removed these redis-server modifications and I have not had an issue anymore.
Correction, the error I am getting is below and started to occur after upgrading to ubuntu x64 22.04.1 LTS.
OCI runtime exec failed: exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown
From searching on the web, it looks to be an SELinux issue