librespot crashes when one of defined DNS servers is not working
Description
Spotify did not connect. In logs I noticed errors that watchdog detected Spotify crash and restarted it. Enabled logs, and noticed in librespot the attached log, when connecting to Spotify connect. It was strange because some of connections in log worked, and some did not. Then I checked resolv.conf and noticed that first DNS provided by DHCP was incorrect (192.168.10.10 - currently not working), second one was fine (192.168.10.1). I removed the first DNS, and Spotify connect worked.
Version
librespot 0.6.0 383a6f6 (Built on 2024-11-02, Build ID: E4uuQIK1, Profile: release) Moodeaudio 9.3.2
How to reproduce
Set incorrect primary and correct secondary DNS, like:
pi@moodeaudio:/var/log $ cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 192.168.10.10
nameserver 192.168.10.1
Log
[2025-04-20T09:49:15Z DEBUG librespot_core::http_client] Requesting https://clienttoken.spotify.com/v1/clienttoken
[2025-04-20T09:49:19Z DEBUG librespot_core::spclient] Received a granted token
[2025-04-20T09:49:19Z TRACE librespot_core::spclient] Got client token: GrantedTokenResponse { token: "AABNyMDFPBffQv5....u9hwDYdtV4DSu", expires_after_seconds: 1216800, refresh_after_seconds: 1209600, domains: [TokenDomain { domain: "spotify.com", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }, TokenDomain { domain: "spotify.net", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }], special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }
[2025-04-20T09:49:19Z DEBUG librespot::component] new ApResolver
[2025-04-20T09:49:19Z DEBUG librespot_core::http_client] Requesting https://apresolve.spotify.com/?type=accesspoint&type=dealer&type=spclient
[2025-04-20T09:49:22Z INFO librespot_core::session] Connecting to AP "ap-gew1.spotify.com:4070"
[2025-04-20T09:49:25Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:25Z DEBUG librespot_core::connection] Retry access point...
[2025-04-20T09:49:28Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:28Z WARN librespot_core::session] Try another access point...
[2025-04-20T09:49:28Z INFO librespot_core::session] Connecting to AP "ap-gew1.spotify.com:443"
[2025-04-20T09:49:31Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:31Z DEBUG librespot_core::connection] Retry access point...
[2025-04-20T09:49:34Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:34Z WARN librespot_core::session] Try another access point...
[2025-04-20T09:49:34Z INFO librespot_core::session] Connecting to AP "ap-gew1.spotify.com:80"
[2025-04-20T09:49:37Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:37Z DEBUG librespot_core::connection] Retry access point...
[2025-04-20T09:49:40Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:40Z WARN librespot_core::session] Try another access point...
[2025-04-20T09:49:40Z INFO librespot_core::session] Connecting to AP "ap-guc3.spotify.com:4070"
[2025-04-20T09:49:43Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:43Z DEBUG librespot_core::connection] Retry access point...
[2025-04-20T09:49:46Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:46Z WARN librespot_core::session] Try another access point...
[2025-04-20T09:49:46Z INFO librespot_core::session] Connecting to AP "ap-gew4.spotify.com:443"
[2025-04-20T09:49:49Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:49Z DEBUG librespot_core::connection] Retry access point...
[2025-04-20T09:49:52Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:52Z WARN librespot_core::session] Try another access point...
[2025-04-20T09:49:52Z INFO librespot_core::session] Connecting to AP "ap-gae2.spotify.com:80"
[2025-04-20T09:49:56Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:56Z DEBUG librespot_core::connection] Retry access point...
[2025-04-20T09:49:59Z DEBUG librespot_core::connection] Connection failed: timed out
[2025-04-20T09:49:59Z ERROR librespot_core::session] Tried too many access points
[2025-04-20T09:49:59Z ERROR librespot] could not initialize spirc: Deadline expired before operation could complete { timed out }
Host (what you are running librespot on):
Moodeaudio 9.3.2 RPI 3B+
Additional context
We'd need to see an actual crash in the log to take any action here.
pi@moodeaudio:/var/log $ moodeutl -l
...
20250420 113351 worker: --
20250420 113351 worker: -- Miscellaneous
20250420 113351 worker: --
20250420 113351 worker: Software update: Automatic check off
20250420 113351 worker: Auto-CoverView: off
20250420 113351 worker: CoverView timeout: Never
20250420 113351 worker: Auto-shuffle: off
20250420 113351 worker: Auto-play: off
20250420 113351 worker: Maintenance task: 360 mins
20250420 113351 DEBUG: sendFECmd(): File open failed, UI has never been opened in Browser
20250420 113351 worker: Current view: reset to Playback
20250420 113351 worker: Active flags: all false
20250420 113351 worker: Reset flags: skipped
20250420 113351 worker: Session check: ok
20250420 113351 worker: --
20250420 113351 worker: -- Startup complete
20250420 113351 worker: --
20250420 113351 worker: Mount monitor: started
20250420 113351 worker: Radio monitor: off
20250420 113353 worker: Watchdog monitor: started
20250420 113353 worker: Responsiveness: Default
20250420 113353 DEBUG: Sleep intervals: worker=3, waitworker=1, watchdog=6, mountmon=30, mpdmon=6, gpiobuttons=1
20250420 113353 worker: Ready
20250420 113527 watchdog: Started Spotify Connect after crash detected
20250420 113528 DEBUG: sendFECmd(): File open failed, UI has never been opened in Browser
20250420 113528 DEBUG: startSpotify(): (librespot --name "moode" --bitrate 320 --format S16 --mixer softvol --initial-volume 5 --volume-ctrl log --volume-range 60 --cache /var/local/www/spotify_cache --disable-audio-cache --backend alsa --device "_audioout" --onevent /var/local/www/commandw/spotevent.sh -v > /var/log/moode_librespot.log 2>&1 &)
20250420 113732 watchdog: Started Spotify Connect after crash detected
20250420 113735 DEBUG: sendFECmd(): File open failed, UI has never been opened in Browser
20250420 113735 DEBUG: startSpotify(): (librespot --name "moode" --bitrate 320 --format S16 --mixer softvol --initial-volume 5 --volume-ctrl log --volume-range 60 --cache /var/local/www/spotify_cache --disable-audio-cache --backend alsa --device "_audioout" --onevent /var/local/www/commandw/spotevent.sh -v > /var/log/moode_librespot.log 2>&1 &)
Right ok, so there's no crash shown in the first log because there's no actual "crash" occurring at all. This is just librespot exiting with error code 1 because the system is badly configured.
This seems reasonable behaviour to me. What would you prefer librespot to do in the case when the network is broken and it cannot reliably connect??
I assumed it was crash, because of the word 'crash' in this line:
20250420 113527 watchdog: Started Spotify Connect after crash detected
What would you prefer librespot to do in the case when the network is broken and it cannot reliably connect??
I don’t think the network is “broken.” There are almost always configured at least two DNS servers for redundancy. If the primary DNS server goes down, the system should simply fall back to the secondary one. I’ve had this mis‑configured setup running for several months on multiple PCs, Macs, and phones, and all networking has worked fine—otherwise I’d have noticed sooner.
So this may not be a librespot issue at all, but rather a problem in the underlying connectivity library or the operating system itself.
Perhaps it's related to the different way we connect to HTTP and APs (access point) servers. Your log shows it takes ~3 seconds for each of the first two HTTP requests to work, they use Hyper's HTTP client. There must be a 3 second timeout somewhere before it tries to use the secondary DNS. Our Spotify AP connections are different, we directly use Tokio sockets and enforce a specific 5 second timeout before giving up and trying the next AP. Before we implemented this timeout we would get stuck on broken servers, a 5 second timeout seemed reasonable (to me) but it was pretty arbitrary.
I agree with you, the AP connections should have used the secondary DNS and it appears like they didn't. I'd have assumed the DNS timeout and retry was done by the OS, and it looks to be 3 seconds, so our 5 seconds should have been enough. Maybe that's a bad assumption.
Perhaps it's related to the different way we connect to HTTP and APs (access point) servers. Your log shows it takes ~3 seconds for each of the first two HTTP requests to work, they use Hyper's HTTP client. There must be a 3 second timeout somewhere before it tries to use the secondary DNS. Our Spotify AP connections are different, we directly use Tokio sockets and enforce a specific 5 second timeout before giving up and trying the next AP. Before we implemented this timeout we would get stuck on broken servers, a 5 second timeout seemed reasonable (to me) but it was pretty arbitrary.
I agree with you, the AP connections should have used the secondary DNS and it appears like they didn't. I'd have assumed the DNS timeout and retry was done by the OS, and it looks to be 3 seconds, so our 5 seconds should have been enough. Maybe that's a bad assumption.
I’m not sure. What’s odd is that, for example, the pre‑configured radio stations streamed just fine—they’re saved by domain name, not by IP—while Spotify Connect didn’t work at all, as though it were using a different DNS resolver. For example, AFAIK Firefox, doesn’t use system level DNS resolver. Note that the incorrect DNS server was listed first in resolv.conf. As far as I know, Linux nameservers are used in the order they’re defined.
the pre‑configured radio stations streamed just fine
I'm not sure I understand, is this something outside of librespot? Some other moode feature?
librespot just uses the normal libc DNS stuff. The DNS servers will be used in the order they are specified. There's some options in resolv.conf to change the timeout and number of retries. I think the default timeout is 5 seconds, not 3, so that's a bit odd. Maybe my theory is wrong. I think we need to employ wireshark and trace exactly what's going on, I'll give it a go at the weekend. Maybe there's something we can improve here around the AP connections.
Just adding another data point of this issue. I originally thought it was this https://github.com/librespot-org/librespot/issues/1481#issuecomment-2781062014 but I never got it to reconnect.
Similar to povserok I had a primary DNS server in /etc/resolv.conf down and network functionality has been fine for everything except librespot. For what it's worth I'm using it through snapserver (snapcast).
I've also had this experience:
I’ve had this mis‑configured setup running for several months on multiple PCs, Macs, and phones, and all networking has worked fine—otherwise I’d have noticed sooner.
This was also the case for me:
Note that the incorrect DNS server was listed first in resolv.conf. As far as I know, Linux nameservers are used in the order they’re defined.
Removing the temporarily down DNS server from resolv.conf is an acceptable workaround for me, and I'll also investigate
There's some options in resolv.conf to change the timeout and number of retries.
to see if I can come up with a timeout value that works with the
a 5 second timeout seemed reasonable (to me) but it was pretty arbitrary.
otherwise maybe I'll investigate upping the timeout in librespot to see if it can just work™ by default on my system (debian 12, AMD64)
Ahh it seems debian defaults to 5 seconds for the timeout.
From the docs (emphasis mine):
timeout:n Sets the amount of time the resolver will wait for a response from a remote name server before retrying the query via a different name server. This may not be the total time taken by any resolver API call and there is no guarantee that a single resolver API call maps to a single timeout. Measured in seconds, the default is RES_TIMEOUT (currently 5, see <resolv.h>). The value for this option is silently capped to 30.
In my resolv.conf setting the following did indeed work (proving @kingosticks assumption that 5 seconds would be enough if the DNS timeout was 3s, and you only had the first DNS server not working, more DNS servers not working would take longer to loop through each server and timeout on them):
# /etc/resolv.conf
options timeout:3 # This was added
nameserver 192.168.1.53 # This is the non-existent DNS server
nameserver 192.168.1.50
I wonder if we could increase librespots arbitrary timeout to maybe 12s (allowing for 2 default timeouts, also pretty arbitrary). I have no idea what the other side of this problem is with having this value too large apart from this comment:
Before we implemented this timeout we would get stuck on broken servers
I'm happy to have a crack at contributing/submitting a PR if increasing the timeout is something the project thinks is worth doing?
Or perhaps it is actually a good idea to (from the original PR):
Would it make sense to make the timeout customizable?
Personally I think adding extra code/config to workaround broken systems isn't worthwhile but that's just my opinion.
I agree with @kingosticks. I've been down that road some time ago with pleezer and it just wasn't worth it. The moment you think you've worked around one DNS configuration quirk, you run into another.
TLDR:
- Code/config for
librespotto support broken systems is undesirable, however this problem can occur for many reasons, not just alibrespotservers config, but transient upstream DNS server issues. - On my system and other docs I've tried to look at the default DNS timeout is 5s.
- I'm interested which systems have a 3s DNS timeout, I've tried to have a little look but it would be quicker to know what system you're (@kingosticks) seeing 3s on?
- Does the algorithm want to be default DNS timeout plus 2s? (or perhaps 2x default DNS timeout, plus 2s)
- Does it make sense to make this value customisable by the end user if systems have different default DNS timeouts?
- What are the issues with having
librespot's AP arbitrary timeout too large? - If none of 4 or 5 above is wanted by the project then we should at least add something to the docs surrounding the AP connections vs DNS timeouts.
- I'm super thankful for this project, happy to jump on a call to discuss further. Also happy to put in a PR if the project wants?
I 100% agree with this statement:
I think adding extra code/config to workaround broken systems isn't worthwhile
But I'm thinking beyond broken systems, I'm thinking about generalising this issue, robustness for librespot, and to avoid the librespot project getting unneeded issues like this raised in the future. I think it was a mistake for me to quote the section that used the term mis-configured for the resolve.conf in my original message on this thread, as DNS timeouts can happen for reasons beyond the librespot server's DNS config and for other issues that need a retry.
DNS is a service that can have timeouts for a myriad reasons. In my case on the server running librespot there was nothing wrong with my reslov.conf config, there was an issue on the primary DNS server it was pointing too, but I happen to know that because I control that server. It is a feature to be able to provide multiple nameservers in DNS to be able to still have connectivity when a name server is down, among other reasons such as specifying the closest DNS server. For example on my gateway router I specify the ISP's DNS server as the primary because it's the closest/fastest response and then googles 8.8.8.8 as a secondary for the times when my ISP DNS server is having issues, this allows me to be more robust to DNS server issues.
The internet is a scary place and routers drop packets all the time, DNS timeouts can occur in transient ways for other reasons beyond a systems resolv.conf settings.
I think it would be reasonable to at least increase librespot's timeout to a value above the default DNS timeout on Linux systems. I've tried to look to see any system that has a default DNS timeout of 3 seconds and can't, I'm happy to be pointed to documentation though.
From this comment by @kingosticks it seems that retrying at least once and an algorithm of default DNS timeout plus 2 seconds is desired. In this case that would mean 7s. I'm also suggesting perhaps 2 timeouts plus 2 seconds maybe desirable so the 12s I was suggesting in an earlier message. Even better would be to let the system OS handle the timeout and use it's own defaults (if that is at even possible) so this would work for any system it was running on (but maybe that's a bigger change, so I'm not suggesting that as a solution for now).
If the default timeout is found to be different on other systems then that is a case for making librespot's timeout be configurable by the end user.
I'm also interested in what the impacts of having this arbitrary AP timeout too large are? Unfortunately I'd have to read a lot of code to understand that, so I'm relying on somebody to make me aware of what the problem of having librespots's AP timeout too large would be.
Failing any of the above there should at least be some tips around this in a FAQ or otherwise in the docs.
I'm happy to pull together a PR for increasing librespot's AP timeout, making it customisable by the end user and/or adding a FAQ/Troubleshooting entry to the docs.
Also happy to jump on a call to discuss any of this as sometime these convos can come across the wrong way via text. I'm really thankful for this project and all the work that goes into it and would like to help if I can.
In the original port above I was speculating a 3 second timeout based on the log we got. I said the 3 seconds in the log didn't match the documented Linux default of 5 seconds. We never worked out why there was a discrepancy. I am still curious why the earlier http requests seem to work differently, I didn't understand that and I didn't look further.
A long timeout on each AP connection attempt can make starting a Spotify session slow. With a long-lived session created at startup that's likely OK but librespot is a library and not all use-cases are like that. I primarily use librespot within gst-plugin-spotify where the architecture requires a session per song and slow session connection means playback starts slowly and it's noticeable. I also use it behind a corporate firewall where some ports are blocked, causing APs using those ports to get stuck during their connection. So I have been relying on the short timeout, but I appreciate this use-case is niche. Maybe a mechanism to better control/filter APs would be nicer but that's another issue.
In general, applications should be able to avoid thinking too much about this stuff and let lower levels handle this correctly and consistently. Our case here is a little bit different, we've been given a list of unreliable AP servers to try, they can even have perfect connectivity but still be unusable as an AP (returning a Spotify error code). Failing fast is the right thing to do in this context but you're right, we also need to go slow enough for IP connections to (appear to) be reliable. So I guess the balance here isn't quite right. However, just fiddling the numbers a bit to avoid this particular OS default just moves the problem.
So yeh, we could again revisit making the timeout an optional parameter, keeping a short default, and using a longer value in the librespot binary. Or the opposite, I guess (and make it a breaking change). Just keep in mind the algorithm we currently have wasn't focused at all on supporting user DNS problems, it was to find a working Spotify AP from the list assuming a working network. Most of the time the same thing will work for both but be careful on doing anything too specific for the secondary case.
I'd still prefer not to expose an extra config option to librespot binary users. Just having it configurable at the library level is far simpler than spending our cycles trying to come up with a single algorithm that works for all cases.
Ahh sorry I missed that about the log above showing the 3s timeout.
Okay that makes sense to me, so we have a situation where some library use cases need the timeout to be as quick as possible and others as robust/retryable as possible. And also where the failure modes of the AP aren't always obvious in a programmatic sense. I'll have a think a bit more about any other ideas but at this stage I'm thinking adding something to the docs to deal with DNS issues like mine might be the best scenario (or even this ticket if people search for it may be enough).
This is a good hint:
I am still curious why the earlier http requests seem to work differently, I didn't understand that and I didn't look further.
At a guess it's regular TCP connections versus websocket connections?
They should be regular http connections, not even websocket. But they do use a library and maybe that library has some different handling somewhere. I'm not sure it is that interesting unless you're really interested!