actually, BSD kqueue is a mountain of technical debt

A side effect of the whole freenode kerfluffle is that I’ve been looking at IRCD again.  IRC, is of course a very weird and interesting place, and the smaller community of people who run IRCDs are largely weirder and even more interesting.

However, in that community of IRCD administrators there happens to be a few incorrect systems programming opinions that have been cargo culted around for years.  This particular blog is about one of these bikesheds, namely the kqueue vs epoll debate.

You’ve probably heard it before.  It goes something like this, “BSD is better for networking, because it has kqueue.  Linux has nothing like kqueue, epoll doesn’t come close.”  While I agree that epoll doesn’t come close, I think that’s actually a feature that has lead to a much more flexible and composable design.

In the beginning…

Originally, IRCD like most daemons used select for polling sockets for readiness, as this was the first polling API available on systems with BSD sockets.  The select syscall works by taking a set of three bitmaps, with each bit describing a file descriptor number: bit 1 refers to file descriptor 1 and so on.  The bitmaps are the read_set, write_set and err_set, which map to sockets that can be read, written to or have errors accordingly.  Due to design defects with the select syscalls, it can only support up to FD_SETSIZE file descriptors on most systems.  This can be mitigated by making fd_set an arbitrarily large bitmap and depending on fdmax to be the upper bound, which is what WinSock has traditionally done on Windows.

The select syscall clearly had some design deficits that negatively affected scalability, so AT&T introduced the poll syscall in System V UNIX.  The poll syscall takes an array of struct pollfd of user-specified length, and updates a bitmap of flags in each struct pollfd entry with the current status of each socket.  Then you iterate over the struct pollfd list.  This is naturally a lot more efficient than select, where you have to iterate over all file descriptors up to fdmax and test for membership in each of the three bitmaps to ascertain each socket’s status.

It can be argued that select was bounded by FD_SETSIZE (which is usually 1024 sockets), while poll begins to have serious scalability issues at around 10240 sockets.  These arbitrary benchmarks have been referred to as the C1K and C10K problems accordingly.  Dan Kegel has a very lengthy post on his website about his experiences mitigating the C10K problem in the context of running an FTP site.

Then there was kqueue…

In July 2000, Jonathan Lemon introduced kqueue into FreeBSD, which quickly propagated into the other BSD forks as well.  kqueue is a kernel-assisted event notification system using two syscalls: kqueue and kevent.  The kqueue syscall creates a handle in the kernel represented as a file descriptor, which a developer uses with kevent to add and remove event filters.  Event filters can match against file descriptors, processes, filesystem paths, timers, and so on.

This design allows for a single-threaded server to process hundreds of thousands of connections at once, because it can register all of the sockets it wishes to monitor with the kernel and then lazily iterate over the sockets as they have events.

Most IRCDs have supported kqueue for the past 15 to 20 years.

And then epoll…

In October 2002, Davide Libenzi got his epoll patch merged into Linux 2.5.44.  Like with kqueue, you use the epoll_create syscall to create a kernel handle which represents the set of descriptors to monitor.  You use the epoll_ctl syscall to add or remove descriptors from that set.  And finally, you use epoll_wait to wait for kernel events.

In general, the scalability aspects are the same to the application programmer: you have your sockets, you use epoll_ctl to add them to the kernel’s epoll handle, and then you wait for events, just like you would with kevent.

Like kqueue, most IRCDs have supported epoll for the past 15 years.

What is a file descriptor, anyway?

To understand the argument I am about to make, we need to talk about file descriptors.  UNIX uses the term file descriptor a lot, even when referring to things which are clearly not files, like network sockets.  Outside the UNIX world, a file descriptor is usually referred to as a kernel handle.  Indeed, in Windows, kernel-managed resources are given the HANDLE type, which makes this relationship more clear.  Essentially, a kernel handle is basically an opaque reference to an object in kernel space, and the astute reader may notice some similarities to the object-capability model as a result.

Now that we understand that file descriptors are actually just kernel handles, we can now talk about kqueue and epoll, and why epoll is actually the correct design.

The problem with event filters

The key difference between epoll and kqueue is that kqueue operates on the notion of event filters instead of kernel handles.  This means that any time you want kqueue to do something new, you have to add a new type of event filter.

FreeBSD presently has 10 different event filter types: EVFILT_READ, EVFILT_WRITE, EVFILT_EMPTY, EVFILT_AIO, EVFILT_VNODE, EVFILT_PROC, EVFILT_PROCDESC, EVFILT_SIGNAL, EVFILT_TIMER and EVFILT_USER.  Darwin has additional event filters concerning monitoring Mach ports.

Other than EVFILT_READ, EVFILT_WRITE and EVFILT_EMPTY, all of these different event filter types are related to entirely different concerns in the kernel: they don’t monitor kernel handles, but instead other specific subsystems than sockets.

This makes for a powerful API, but one which lacks composability.

epoll is better because it is composable

It is possible to do almost everything that kqueue can do on FreeBSD in Linux, but instead of having a single monolithic syscall to handle everything, Linux takes the approach of providing syscalls which allow almost anything to be represented as a kernel handle.

Since epoll strictly monitors kernel handles, you can register any kernel handle you have with it and get events back when its state changes.  As a comparison to Windows, this basically means that epoll is a kernel-accelerated form of WaitForMultipleObjects in the Win32 API.

You are probably wondering how this works, so here’s a table of commonly used kqueue event filters and the Linux syscall used to get a kernel handle for use with epoll.

BSD event filter Linux equivalent
EVFILT_READ, EVFILT_WRITE, EVFILT_EMPTY Pass the socket with EPOLLIN etc.
EVFILT_VNODE inotify
EVFILT_SIGNAL signalfd
EVFILT_TIMER timerfd
EVFILT_USER eventfd
EVFILT_PROC, EVFILT_PROCDESC pidfd, alternatively bind processes to a cgroup and monitor cgroup.events
EVFILT_AIO aiocb.aio_fildes (treat as socket)

Hopefully, as you can see, epoll can automatically monitor any kind of kernel resource without having to be modified, due to its composable design, which makes it superior to kqueue from the perspective of having less technical debt.

Interestingly, FreeBSD has added support for Linux’s eventfd recently, so it appears that they may take kqueue in this direction as well.  Between that and FreeBSD’s process descriptors, it seems likely.

3 thoughts on “actually, BSD kqueue is a mountain of technical debt”

  1. Spot on (though I may be biased, see disclaimer at bottom).

    Delving deeper into this, every file type in Linux uses the same f_op->poll
    callback for select(2), poll(2), epoll_wait(2), etc. So the implementor of a
    new “struct file” type only needs to supply the f_op->poll callback once.
    (Sorry, I’m not up-to-date with io_uring, yet)

    Epoll is pretty much self-contained within fs/eventpoll.c and it’s easy-reading.
    Linux VFS is highly orthogonal in this regard.

    More along those lines:

    kqueue(2) has a unique “close-on-fork” behavior that can’t be disabled.
    It relies on close-on-fork because it’s internally based on file descriptors
    (not file descriptions). AFAIK, no other filetype anywhere in *nix has
    close-on-fork behavior.

    n.b: don’t confuse “close-on-fork” with “close-on-exec” aka FD_CLOEXEC,
    Every file descriptor can have FD_CLOEXEC.

    epoll users need to close-on-fork manually since the Linux kernel doesn’t
    support close-on-fork (last I checked). I suspect this is a major reason why
    programmers used to kqueue get confused by epoll: A frequent complaint is
    dup/dup2 behavior with epoll, and I suspect some confusion is related to the
    lack of close-on-fork behavior on the epoll side.

    epoll’s dup/dup2 behavior is intentional to support different pointers/events
    for the same file description:

    https://lore.kernel.org/lkml/Pine.LNX.4.55.0307071802360.3531@bigblue.dev.mcafeelabs.com/

    Less-experienced programmers who don’t know the difference between file
    descriptors and file descriptions may be tripped up by this.

    epoll’s other weakness is the number of syscalls required for epoll_ctl(2) ops.
    Syscalls were cheap when epoll was designed (not anymore with CPU vulnerability
    mitigations). Again, the ops are low-level and composable:

    EPOLL_CTL_MOD is a significantly cheaper operation than EPOLL_CTL_ADD or
    EPOLL_CTL_DEL since MOD doesn’t lock the rbtree for writes. Having an API
    which makes MOD==ADD (like kevent does) would pessimize locking, especially
    when it involves changing multiple FDs with some needing ADD while others
    can be MOD.

    kqueue has close-on-fork, though, so that means it doesn’t need a tree/hashtable
    and can use a simple array to map FDs.

    io_uring ought to be able to mitigate the syscall cost nowadays, but I’ve
    yet to find time to dig into it.

    In conclusion:

    kqueue is unconventional internally, but it’s close-on-fork behavior seems to
    favor common/conventional cases at the expense of more clever things and
    orthogonality.

    Epoll’s design itself is more conventional; so perhaps a bit trickier to use.
    However, it supports less-conventional things just fine (e.g. epoll can
    distinguish POLLERR just like poll(2) and select(2) do. Last I checked
    kevent EVFILT_READ doesn’t know POLLERR from POLLIN).

    This composability and orthogonality in implementation is more true to
    the “worse-is-better” philosophy.

    IMHO, the worst part of epoll is supporting nested epoll properly
    (circular references, wakeup cascades, etc.)

    Disclaimer: I’ve contributed to epoll some, but I’ve also contributed to
    libkqueue (userspace kqueue emulation) a bit, too. I’ve used both directly, and
    often in twisted ways with multiple threads operating on the same ep/kq file
    description. That said, I had absolutely nothing to do with the initial
    implementation of either; I merely abuse the dark corners of them 🙂

Comments are closed.