After installing a few Debian VMs inside our Windows environment, I noticed very strange performance problems resolving local domain names on local DNS servers this week. Simple queries which should have taken milliseconds would sometimes be very slow. And these slow queries would constantly take 50 deciseconds to resolve - never 49 or less. It looked like a timeout, but logs had no such mentions, and it was hard to tell when the timeouts would occur, except that they would occur more on a first test after I stopped testing for a few minutes. For example, a trivial connection to a local MySQL server could take just above 50 ds to establish:
This was far from MySQL-specific. dig(1) would suffer from the same delays:
; <<>> DiG 9.9.5-9+deb8u3-Debian <<>> @phobos.lan.gvq titan.lan.gvq
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15593
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;titan.lan.gvq. IN A
;; ANSWER SECTION:
titan.lan.gvq. 3600 IN A 10.10.1.29
;; Query time: 0 msec
;; SERVER: 10.10.1.23#53(10.10.1.23)
;; WHEN: Fri Nov 27 12:14:42 EST 2015
;; MSG SIZE rcvd: 58
...where phobos.lan.gvq is a local DNS server, and titan is just a local hostname which is supposed to resolve very quickly. Attentive readers will notice that Query time indicates 0 ms. This is because the DNS query proper does take 0 ms. The delay comes from the resolution of the name server itself, which I specified by name. This cannot be reproduced with dig if the name server is specified by IP.
This turned out to be an IPv6-related glibc issue. The first big advance came from a Stack Exchange thread, which allowed me to confirm that the delay was due to a timeout in glibc's getaddrinfo(3). This can be achieved with high certitude by changing that delay using the resolv.conf timeout option. glibc's default timeout is 5 seconds. For example, if you notice that the delay decreases to 3 s after setting "options timeout:3", then you are clearly experiencing timeouts. If not, sorry, this post will not help you.
The next step was to determine whether that timeout was IPv6-related. This can be achieved by disabling IPv6 on the GNU clients, but it may be simpler to just set options single-request and single-request-reopen. If none of these helped, you know your problem is caused by timeouts, but the cause is different than ours, and the rest of this post will not help.
If disabling IPv6 helped but single-request and single-request-reopen do not, sorry, I do not know more about your issue. But if single-request or single-request-reopen helped, your problem must be similar to ours. Due to a glibc 2.9 change (see section "DNS NSS improvement"), getaddrinfo() often causes a communication issue between itself and the DNS server when querying either IPv4 or IPv6 addresses due to what Ulrich Drepper describes as server-side breakage. Since at least glibc 2.10, if glibc detects that glitch may have happened, it workarounds by re-sending the queries serially rather than in parallel, so the problem "merely" causes a timeout. If there is a firewall between your DNS server and you, see the Stack Exchange thread above. If a firewall issue is excluded and your DNS server is running Windows Server, you are probably experiencing the same incompatibility as ours.
I first thought our Windows Server 2008 [R1] servers were causing this because of an old bug, but according to a 2014 blog post, this still happens with Windows Server 2012 R2. Although the tcpdump shown on the Stack Exchange thread describes pretty well what is going on, I had to perform my own captures to understand why the timeout would only happen sometimes, and succeeded quickly enough. When the problem does not happen, getaddrinfo() queries both A and AAAA (IPv6) records in parallel in packets 7 and 8 and receives both replies in packets 9 and 10:
Packets 11 and 12 show the DNS query proper, since this capture shows the full activity for the dig command explained above.
When the problem happens, what was packet 9 in capture 1 is gone. Which is why getaddrinfo() retries 5 seconds later (after the gap between packet 26 and 30), in packets 30 and 32, but now sequentially:
Why does the problem happen in capture 2? Surely because of that extra color... the beige ARP packets at 24 and 25. In other words, in the first call, the DNS client's IP address is in the DNS server's ARP cache, so the server does not need to resolve the client IP address. In the second case, the DNS clients's ARP cache in the DNS server has expired, so the server needs to perform an ARP query before being able to send what would be packets 9 and 10 in the first case (I would have thought the server could figure out the ARP address from packets 22 and 23, but apparently that is not how that Windows works).
As explained in Microsoft's ARP caching behavior documentation, in recent Windows versions, an ARP cache record is [usually] maintained for a random time between 30 and 90 seconds after the last time it was used. This must be why that bug was pretty hard to track. Therefore, if the server and the client communicate at least each 30 seconds, this timeout should only be experienced once. This means that in the case of Windows Server DNS servers, the behavior would be the same if glibc didn't fallback to serial queries after the timeout.
Causes and solutions
I have not found a server-side workaround (besides, I guess, disabling IPv6). Unfortunately, I believe this needs to be worked around on every GNU client.
It is more interesting to try determining the root cause of this issue and definitive solutions. glibc developers consider it a Windows bug. But would Microsoft leave a bug which must be triggered millions of times per day unfixed for years?
The captures clearly show that glibc starts with the IPv4 query. Which means the Windows server can only send the AAAA reply after it can send the A reply. In general, that must mean it replies to both. But when the server has to wait for an ARP reply before sending its DNS reply, it may have received the AAAA request before it is able to send the A reply. I would need to perform a server-side capture to confirm that, but it could be that Windows detects that situation and decides to send a single reply to save bandwidth and/or favor IPv6 usage. If the goal was simply to favor IPv6, it would probably be better to just send the AAAA reply before the A reply.
Windows may be doing a heuristic optimization by guessing that the client just needs one address, which would certainly be wrong sometimes. This could be considered a bug in so far as failure to reply constitutes a bug.
DNS clients and the protocol
But there is certainly a client-side issue as well at least in this case. The client requests both an IPv4 address and an IPv6 address while it only needs one. Unless this is a strategy to minimize further queries, this is inefficient.
According to this Stack Overflow thread, it is not clear that requesting both A and AAAA records in a single DNS query is possible. And even that would not be the most optimal solution — that is, requesting whatever single IP address should be used.
From getaddrinfo()'s perspective, it cannot be optimized, since the caller has requested any address to be returned. So the problem is really in dig and other DNS clients calling getaddrinfo() just to resolve a hostname. These clients are all suboptimal. gethostbyname() is optimal, but obsolete since it is not compatible with IPv6. There should be a resolving function which either returns the first IP address obtained, or returns both without blocking while waiting for the second. Clearly, each program cannot implement such a function itself. I do not know glibc, but a C library's API should allow such a resolution. If it doesn't, glibc has an issue too.