fgci-ansible icon indicating copy to clipboard operation
fgci-ansible copied to clipboard

Improve DNS

Open jabl opened this issue 8 years ago • 4 comments

Occasionally our users are hitting slurm problems like

sbatch: error: Unable to resolve "slurmctld-host.example.org": Unknown host sbatch: error: Unable to establish control machine address sbatch: error: Batch job submission failed: No error

We're not 100% sure why this happens, my best guess at the moment is something like the DNS server for the cluster internal net (dnsmasq on the install host) doesn't answer fast enough, and then the next entry in /etc/resolv.conf is tried, which is an external DNS server which doesn't know anything about the cluster internal net, and thus we get the failure.

One thing that might make us especially susceptible to this is that as we use sssd we have disabled nscd, which normally does cache dns lookups.

I think a better DNS setup would be something like

  • We should run a backup DNS server for the internal net, in case the install node is down or doesn't answer fast enough. I'd guess the admin node could be a good choice for this, except that at least on our system /etc/hosts on the admin-node also contains the IPMI addresses, so we'd need another file for dnsmasq to read the hosts from.
  • The DNS servers should be configured to recurse to the external DNS servers for any records they are not authoritative for.
  • All other nodes, which aren't DNS servers, should run a local DNS caching resolver. dnsmasq or unbound seem to be the best choices here, consensus on the Internet (TM) seems to be that nscd dns caching is crap and should not be used. So on these nodes /etc/resolv.conf should only contain 127.0.0.1 as the only nameserver.
  • The local DNS cache's should recurse to the authoritative DNS servers for the internal net, and never directly to the outside DNS servers.

jabl avatar Jun 10 '16 09:06 jabl

yes, DNS is a weak point since the begining. I'll have to study this a bit more regarding implications but what you're suggesting makes sense. Let's see if we can get this properly done and tested before September.

A1ve5 avatar Jun 10 '16 12:06 A1ve5

I started an effort to setup a caching stub resolver on all the nodes at https://github.com/jabl/ansible-role-systemd-resolved (using systemd-resolved). Unfortunately it turns out that systemd-resolved v219 in EL 7.2 doesn't resolve short hostnames correctly. However, as far as I've been able to determine, newer versions should at least have some improvements here, and I guess EL 7.3 will rebase systemd to a newer version, so if nothing else one could wait a few more months until EL 7.3 is out (beta was recently released) and check again.

jabl avatar Aug 31 '16 12:08 jabl

Added for review - are these problems gone now?

martbhell avatar Sep 21 '16 06:09 martbhell

So far we haven't got any complaints, so I suppose the /etc/hosts thing fixed the cluster internal name resolution woes. That being said, to robustly resolve external names something like the original suggestion above is probably still needed. Of course, that's not nearly as critical as, say, a job failing to look up the slurm controller.

jabl avatar Sep 21 '16 06:09 jabl