the tragedy of gethostbyname

A frequent complaint expressed on a certain website about Alpine is related to the deficiencies regarding the musl DNS resolver when querying large zones. In response, it is usually mentioned that applications which are expecting reliable DNS lookups should be using a dedicated DNS library for this task, not the getaddrinfo or gethostbyname APIs, but this is usually rebuffed by comments saying that these APIs are fine to use because they are allegedly reliable on GNU/Linux.

For a number of reasons, the assertion that DNS resolution via these APIs under glibc is more reliable is false, but to understand why, we must look at the history of why a libc is responsible for shipping these functions to begin with, and how these APIs evolved over the years. For instance, did you know that gethostbyname originally didn’t do DNS queries at all? And, the big question: why are these APIs blocking, when DNS is inherently an asynchronous protocol?

Before we get into this, it is important to again restate that if you are an application developer, and your application depends on reliable DNS performance, you must absolutely use a dedicated DNS resolver library designed for this task. There are many libraries available that are good for this purpose, such as c-ares, GNU adns, s6-dns and OpenBSD’s libasr. As should hopefully become obvious at the end of this article, the DNS clients included with libc are designed to provide basic functionality only, and there is no guarantee of portable behavior across client implementations.

the introduction of gethostbyname

Where did gethostbyname come from, anyway? Most people believe this function came from BIND, the reference DNS implementation developed by the Berkeley CSRG. In reality, it was introduced to BSD in 1982, alongside the sethostent and gethostent APIs. I happen to have a copy of the 4.2BSD source code, so here is the implementation from 4.2BSD, which was released in early 1983:

struct hostent *
gethostbyname(name)
	register char *name;
{
	register struct hostent *p;
	register char **cp;

	sethostent(0);
	while (p = gethostent()) {
		if (strcmp(p->h_name, name) == 0)
			break;
		for (cp = p->h_aliases; *cp != 0; cp++)
			if (strcmp(*cp, name) == 0)
				goto found;
	}
found:
	endhostent();
	return (p);
}

As you can see, the 4.2BSD implementation only checks the /etc/hosts file and nothing else. This answers the question about why gethostbyname and its successor, getaddrinfo do DNS queries in a blocking way: they did not want to introduce a replacement API for gethostbyname that was asynchronous.

the introduction of DNS to gethostbyname

DNS resolution was first introduced to gethostbyname in 1984, when it was introduced to BSD. This version, which is too long to include here also translated dotted-quad IPv4 addresses into a struct hostent. In essence, the 4.3BSD implementation does the following:

  1. If the requested hostname begins with a number, try to parse it as a dotted quad. If this fails, set h_errno to HOST_NOT_FOUND and bail. Yes, this means 4.3BSD would fail to resolve hostnames like 12-34-56-78.static.example.com.
  2. Attempt to do a DNS query using res_search. If the query was successful, return the first IP address found as the struct hostent.
  3. If the DNS query failed, fall back to the original /etc/hosts searching algorithm above, now called _gethtbyname and using strcasecmp instead of strcmp (for consistency with DNS).

A fixed version of this algorithm was also included with BIND’s libresolv as res_gethostbyname, and the res_search and related functions were imported into BSD libc from BIND.

standardization of gethostbyname in POSIX

The gethostbyname and getaddrinfo APIs were first standardized in X/Open Networking Services Issue 4 (commonly referred to as XNS4) specification, which itself was part of the X/Open Single Unix Specification version 3 (commonly referred to as SUSv3), released in 1995. Of note, X/Open tried to deprecate gethostbyname in favor of getaddrinfo as part of the XNS5 specification, removing it entirely except for a mention in their specification for netdb.h.

Later, it returned as part of POSIX issue 6, released in 2004. That version says:

Note: In many cases it is implemented by the Domain Name System, as documented in RFC 1034, RFC 1035, and RFC 1886.

POSIX issue 6, IEEE 1003.1:2004.

Oh no, what is this about, and do application developers need to care about it? Very simply, it is about the Name Service Switch, frequently referred to as NSS, which allows the gethostbyname function to have hotpluggable implementations. The Name Service Switch was a feature introduced to Solaris, which was implemented to allow support for Sun’s NIS+ directory service.

As developers of other operating systems wanted to support software like Kerberos and LDAP, it quickly was reimplemented in other systems as well, such as GNU/Linux. These days, systems running systemd frequently use this feature in combination with a custom NSS module named nss-systemd to force use of systemd-resolved as the DNS resolver, which has different behavior than the original DNS client derived from BIND that ships in most libc implementations.

An administrator can disable support for DNS lookups entirely, simply by editing the /etc/nsswitch.conf file and removing the dns module, which means application developers depending on reliable DNS service need to care a lot about this: it means on systems with NSS, your application cannot depend on gethostbyname to actually support DNS at all.

musl and DNS

Given the background above, it should be obvious by now that musl’s DNS client was written under the assumption that applications that have specific requirements for DNS would be using a specialized library for this purpose, as gethostbyname and getaddrinfo are not really suitable APIs, since their behavior is entirely implementation-defined and largely focused around blocking queries to a directory service.

Because of this, the DNS client was written to behave as simply as possible, but the use of DNS for bulk data distribution, such as in DNSSEC, DKIM and other applications, have led to a desire to implement support for DNS over TCP as an extension to the musl DNS client.

In practice, this will fix the remaining complaints about the musl DNS client once it lands in a musl release, but application authors depending on reliable DNS performance should really use a dedicated DNS client library for that purpose: using APIs that were designed to simply parse /etc/hosts and had DNS support shoehorned into them will always deliver unreliable results.