Recently, I was working on a system and noticed some odd behavior – running a simple db2pd command was taking what felt like a “long” time. I used the time utility to validate my impression:

db2inst1@db2server $ time db2pd -
 
Database Member 0 -- Active -- Up 0 days 21:53:44 -- Date 2021-07-22-11.29.15.283543
 

real    0m5.560s
user    0m0.132s
sys     0m0.081s

A very simple db2pd command should not take 5.6 seconds to complete.

Identifying the problem

I considered using the Db2 trace facility to help identify the issue, but on a busy system this could produce a huge amount of output. Instead, since db2pd is a standalone utility, I chose to use the linux strace tool, which displays every system call that an application makes. There are a number of options for displaying the time that each call takes, but I just used the -t option which displays the current time when each call occurs.

db2inst1@db2server $ strace -t db2pd -
10:54:59 execve("/home/db2inst1/sqllib/adm/db2pd", ["db2pd", "-"], 0x7ffe8779fe00 /* 24 vars */) = 0
10:54:59 access("/etc/suid-debug", F_OK) = -1 ENOENT (No such file or directory)
10:54:59 brk(NULL)                      = 0x339d000
10:54:59 fcntl(0, F_GETFD)              = 0
10:54:59 fcntl(1, F_GETFD)              = 0
10:54:59 fcntl(2, F_GETFD)              = 0
10:54:59 access("/etc/suid-debug", F_OK) = -1 ENOENT (No such file or directory)
10:54:59 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
10:54:59 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6284375000
10:54:59 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
10:54:59 openat(AT_FDCWD, "/DoNotCreateThisPath_marker1.*chglibpath/tls/haswell/x86_64/libdb2osse.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
10:54:59 stat("/DoNotCreateThisPath_marker1.*chglibpath/tls/haswell/x86_64", 0x7ffcce353200) = -1 ENOENT (No such file or directory)
10:54:59 openat(AT_FDCWD, "/DoNotCreateThisPath_marker1.*chglibpath/tls/haswell/libdb2osse.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

[...]

The strace command generates a lot of output when running this simple db2pd command – over 1750 lines – so it’s easiest to redirect the output to a file. It writes the system call information to STDERR, so I redirected STDERR to a file:

db2inst1@db2server $ strace -t db2pd - 2>strace.out

Database Member 0 -- Active -- Up 2 days 23:04:59 -- Date 2021-07-22-10.58.29.257683

To find the source of slowness, I looked through the output file, looking for any significant differences between subsequent timestamps.

db2inst1@db2server $ awk '{print $1}' strace.out | uniq
12:27:58
12:27:59
12:28:00
12:28:01
12:28:02
12:28:03
12:28:08
12:28:09

In the output above, there was a 5 second period between 12:28:03 and 12:28:08 where no output appeared. Here is an excerpt from the strace output from either side of that time gap:

12:28:03 recvfrom(9, "2_\201\202\0\1\0\0\0\0\0\0\rdb2server\0\0\1\0\1", 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.0.27")}, [28->16]) = 31
12:28:03 poll([{fd=9, events=POLLIN}], 1, 5882) = 0 (Timeout)
12:28:08 close(7)                       = 0
12:28:08 close(8)                       = 0
12:28:08 close(9)                       = 0

The last call before the time call was a call to poll(), which is waiting for input from file descriptor 9 (fd=9). The (Timeout) at the end of the line indicates that the call to poll() eventually timed out (after 5 seconds) – this timeout is the source of the delay.

To determine what happened, we need to understand what file descriptor 9 is. Searching backwards through the strace output reveals:

12:28:02 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 9
12:28:02 setsockopt(9, SOL_IP, IP_RECVERR, [1], 4) = 0
12:28:02 connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.0.0.27")}, 16) = 0

The socket() call returns 9 (the file descriptor for the socket), and the connect() call reveals that the process is opening a connection to the IP address 10.0.0.27, port 53.

Port 53 is used for DNS lookups (as can be seen in /etc/services) – so we know that a DNS query was timing out. The recvfrom() call at 12:28:03 (just prior to the poll() call) indicates that the DNS search was for “db2server” – which is the hostname of the Db2 server!

The nslookup command can be used to see what DNS knows:

db2inst1@db2server $ nslookup db2server
;; Got SERVFAIL reply from 10.1.0.55, trying next server
;; Got SERVFAIL reply from 10.0.0.203, trying next server
Server:         10.0.0.27
Address:        10.0.0.27#53

** server can't find db2server: SERVFAIL

This output shows that nslookup tried all 3 DNS servers, and each failed – there was no entry in DNS.

However: normally, resolving the local server’s host name is handled by /etc/hosts and not by DNS; so I looked at /etc/nsswitch.conf to make sure that the order is configured properly:

db2inst1@db2server $ grep 'hosts:' /etc/nsswitch.conf 
hosts:      files dns myhostname

This looks correct: hostname lookups will first use local files (/etc/hosts), and then resort to DNS.

Knowing this, it’s clear that there is no entry in /etc/hosts for the local hostname, and that is the reason why db2pd is querying DNS, the source of the problem.

After adding an entry for the local server to /etc/hosts, db2pd was running with a normal response time:

db2inst1@db2server $ time db2pd -

Database Member 0 -- Active -- Up 0 days 21:53:44 -- Date 2021-07-22-11.29.15.283543


real    0m0.184s
user    0m0.119s
sys     0m0.061s

Even on systems that do not use multiple servers (i.e., DPF or pureScale), it’s still critical to have proper entries in the /etc/hosts file.