Monitoring for process completion in 2021

A historical defect in the ifupdown suite has been the lack of proper supervision of processes run by the system in order to bring up and down interfaces. Specifically, it is possible in historical ifupdown for a process to hang forever, at which point the system will fail to finish configuring interfaces. As interface configuration is part of the boot process, this means that the boot process can potentially hang forever and fail to complete. Accordingly, we have introduced correct supervision of processes run by ifupdown-ng in the upcoming version 0.12, with a 5 minute timeout.

Because ifupdown-ng is intended to be portable, we had to implement two versions of the process completion monitoring routine. The portable version is a busy loop, which sleeps for 50 milliseconds between iteration, and the non-portable version uses Linux process descriptors, a feature introduced in Linux 5.3. For earlier versions, ifupdown-ng will downgrade to using the portable implementation. There are also a couple of other ways that one can monitor for process completion using notifications, but they were not appropriate for the ifupdown-ng design.

Busy-waiting with waitpid(2)

The portable version, as previously noted, uses a busy loop which sleeps for short durations of time. A naive version of a routine which does this would look something like:

/* return true if process exited successfully, false in any other case */
bool
monitor_with_timeout(pid_t child_pid, int timeout_sec)
{
    int status;
    int ticks;

    while (ticks < timeout_sec * 10)
    {
        /* waitpid returns the child PID on success */
        if (waitpid(child, &status, WNOHANG) == child)
            return WIFEXITED(status) && WEXITSTATUS(status) == 0;

        /* sleep 100ms */
        usleep(100000);
        ticks++;
    }

    /* timeout exceeded, kill the child process and error */
    kill(child, SIGKILL);
    waitpid(child, &status, WNOHANG);
    return false;
}

This approach, however, has some performance drawbacks. If the process has not already completed by the time that monitoring of it has begun, then you will be delayed at least 100ms. In the case of ifupdown-ng, almost all processes are very short-lived, so this is not a major issue, however, we can do better by tightening the event loop. Another optimization is to split the sleep part into two steps, allowing for the initial call to waitpid to have better chances of reaping the completed process:

/* return true if process exited successfully, false in any other case */
bool
monitor_with_timeout(pid_t child_pid, int timeout_sec)
{
    int status;
    int ticks;

    while (ticks < timeout_sec * 20)
    {
        /* sleep 50usec to allow the child PID to complete */
        usleep(50);

        /* waitpid returns the child PID on success */
        if (waitpid(child, &status, WNOHANG) == child)
            return WIFEXITED(status) && WEXITSTATUS(status) == 0;

        /* sleep 49.95ms */
        usleep(49950);
        ticks++;
    }

    /* timeout exceeded, kill the child process and error */
    kill(child, SIGKILL);
    waitpid(child, &status, WNOHANG);
    return false;
}

This works fairly well in practice: there is no performance regression on the ifupdown-ng test suite with this implementation.

The self-pipe trick

Daniel J. Bernstein described a trick in the early 90s that allows for process completion notifications to be delivered via a pollable file descriptor called the self-pipe trick. It is portable to any POSIX-compliant system, and can be used with poll or whatever you wish to use. It works by installing a signal handler against SIGCHLD that writes to a descriptor obtained with pipe(2). The downside of this approach is that you have to write quite a bit of code, and you have to track which pipe FD is associated with which PID. It also wastes a file descriptor per process, since you have a file descriptor for both sides of the pipe.

Linux’s signalfd

What if we could turn delivery of signals into a pollable file descriptor? This is precisely what Linux’s signalfd does. The basic idea here is to open a signalfd, associate SIGCHLD with it, and then do the waitpid(2) call when SIGCHLD is received at the signalfd. The downside with this approach is similar to the self-pipe trick, you have to keep global state in order to accomplish it, as there can only be a single SIGCHLD handler.

Process descriptors

FreeBSD introduced support for process descriptors in 2010 as part of the Capsicum framework. A process descriptor is an opaque handle to a specific process in the kernel. This is helpful as it avoids race conditions involving the recycling of PIDs. And since they are kernel handles, they can be waited on with kqueue like other kernel objects, by using EVFILT_PROCDESC.

There have been a few attempts to introduce process descriptors to Linux over the years. The attempt which finally succeeded was Christian Brauner’s pidfd API, completely landing in Linux 5.4, although parts of it were functional in prior releases. Like FreeBSD’s process descriptors, a pidfd is an opaque reference to a specific struct task_struct in the kernel, and is also pollable, making it quite suitable for notification monitoring.

A problem with using the pidfd API, however, is that it is not presently implemented in either glibc or musl, which means that applications will need to provide stub implementations of the API themselves for now. This issue with having to write our own stub aside, the solution is quite elegant:

#include <sys/syscall.h>

#if defined(__linux__) && defined(__NR_pidfd_open)

static inline int
local_pidfd_open(pid_t pid, unsigned int flags)
{
	return syscall(__NR_pidfd_open, pid, flags);
}

/* return true if process exited successfully, false in any other case */
bool
monitor_with_timeout(pid_t child_pid, int timeout_sec)
{
    int status;
    int pidfd = local_pidfd_open(child_pid, 0);
    if (pidfd < 0)
        return false;

    struct pollfd pfd = {
        .fd = pidfd,
        .pollin = POLLIN,
    };

    /* poll(2) returns the number of ready FDs, if it is less than
     * one, it means our process has timed out.
     */
    if (poll(&pfd, 1, timeout_sec * 1000) < 1)
    {
        close(pidfd);
        kill(child, SIGKILL);
        waitpid(child, &status, WNOHANG);
        return false;
    }

    /* if poll did return a ready FD, process completed. */ 
    waitpid(child, &status, WNOHANG);
    close(pidfd);

    return WIFEXITED(status) && WEXITSTATUS(status) == 0;
}

#endif

It will be interesting to see process supervisors (and other programs which perform short-lived supervision) adopt these new APIs. As for me, I will probably prepare patches to include pidfd_open and the other syscalls in musl as soon as possible.