actually, BSD kqueue is a mountain of technical debt

A side effect of the whole freenode kerfluffle is that I’ve been looking at IRCD again.  IRC, is of course a very weird and interesting place, and the smaller community of people who run IRCDs are largely weirder and even more interesting.

However, in that community of IRCD administrators there happens to be a few incorrect systems programming opinions that have been cargo culted around for years.  This particular blog is about one of these bikesheds, namely the kqueue vs epoll debate.

You’ve probably heard it before.  It goes something like this, “BSD is better for networking, because it has kqueue.  Linux has nothing like kqueue, epoll doesn’t come close."  While I agree that epoll doesn’t come close, I think that’s actually a feature that has lead to a much more flexible and composable design.

In the beginning…

Originally, IRCD like most daemons used select for polling sockets for readiness, as this was the first polling API available on systems with BSD sockets.  The select syscall works by taking a set of three bitmaps, with each bit describing a file descriptor number: bit 1 refers to file descriptor 1 and so on.  The bitmaps are the read_set, write_set and err_set, which map to sockets that can be read, written to or have errors accordingly.  Due to design defects with the select syscalls, it can only support up to FD_SETSIZE file descriptors on most systems.  This can be mitigated by making fd_set an arbitrarily large bitmap and depending on fdmax to be the upper bound, which is what WinSock has traditionally done on Windows.

The select syscall clearly had some design deficits that negatively affected scalability, so AT&T introduced the poll syscall in System V UNIX.  The poll syscall takes an array of struct pollfd of user-specified length, and updates a bitmap of flags in each struct pollfd entry with the current status of each socket.  Then you iterate over the struct pollfd list.  This is naturally a lot more efficient than select, where you have to iterate over all file descriptors up to fdmax and test for membership in each of the three bitmaps to ascertain each socket’s status.

It can be argued that select was bounded by FD_SETSIZE (which is usually 1024 sockets), while poll begins to have serious scalability issues at around 10240 sockets.  These arbitrary benchmarks have been referred to as the C1K and C10K problems accordingly.  Dan Kegel has a very lengthy post on his website about his experiences mitigating the C10K problem in the context of running an FTP site.

Then there was kqueue…

In July 2000, Jonathan Lemon introduced kqueue into FreeBSD, which quickly propagated into the other BSD forks as well.  kqueue is a kernel-assisted event notification system using two syscalls: kqueue and kevent.  The kqueue syscall creates a handle in the kernel represented as a file descriptor, which a developer uses with kevent to add and remove event filters.  Event filters can match against file descriptors, processes, filesystem paths, timers, and so on.

This design allows for a single-threaded server to process hundreds of thousands of connections at once, because it can register all of the sockets it wishes to monitor with the kernel and then lazily iterate over the sockets as they have events.

Most IRCDs have supported kqueue for the past 15 to 20 years.

And then epoll…

In October 2002, Davide Libenzi got his epoll patch merged into Linux 2.5.44.  Like with kqueue, you use the epoll_create syscall to create a kernel handle which represents the set of descriptors to monitor.  You use the epoll_ctl syscall to add or remove descriptors from that set.  And finally, you use epoll_wait to wait for kernel events.

In general, the scalability aspects are the same to the application programmer: you have your sockets, you use epoll_ctl to add them to the kernel’s epoll handle, and then you wait for events, just like you would with kevent.

Like kqueue, most IRCDs have supported epoll for the past 15 years.

What is a file descriptor, anyway?

To understand the argument I am about to make, we need to talk about file descriptors.  UNIX uses the term file descriptor a lot, even when referring to things which are clearly not files, like network sockets.  Outside the UNIX world, a file descriptor is usually referred to as a kernel handle.  Indeed, in Windows, kernel-managed resources are given the HANDLE type, which makes this relationship more clear.  Essentially, a kernel handle is basically an opaque reference to an object in kernel space, and the astute reader may notice some similarities to the object-capability model as a result.

Now that we understand that file descriptors are actually just kernel handles, we can now talk about kqueue and epoll, and why epoll is actually the correct design.

The problem with event filters

The key difference between epoll and kqueue is that kqueue operates on the notion of event filters instead of kernel handles.  This means that any time you want kqueue to do something new, you have to add a new type of event filter.

FreeBSD presently has 10 different event filter types: EVFILT_READ, EVFILT_WRITE, EVFILT_EMPTY, EVFILT_AIO, EVFILT_VNODE, EVFILT_PROC, EVFILT_PROCDESC, EVFILT_SIGNAL, EVFILT_TIMER and EVFILT_USER.  Darwin has additional event filters concerning monitoring Mach ports.

Other than EVFILT_READ, EVFILT_WRITE and EVFILT_EMPTY, all of these different event filter types are related to entirely different concerns in the kernel: they don’t monitor kernel handles, but instead other specific subsystems than sockets.

This makes for a powerful API, but one which lacks composability.

epoll is better because it is composable

It is possible to do almost everything that kqueue can do on FreeBSD in Linux, but instead of having a single monolithic syscall to handle everything, Linux takes the approach of providing syscalls which allow almost anything to be represented as a kernel handle.

Since epoll strictly monitors kernel handles, you can register any kernel handle you have with it and get events back when its state changes.  As a comparison to Windows, this basically means that epoll is a kernel-accelerated form of WaitForMultipleObjects in the Win32 API.

You are probably wondering how this works, so here’s a table of commonly used kqueue event filters and the Linux syscall used to get a kernel handle for use with epoll.

BSD event filter Linux equivalent
EVFILT_READ, EVFILT_WRITE, EVFILT_EMPTY Pass the socket with EPOLLIN etc.
EVFILT_VNODE inotify
EVFILT_SIGNAL signalfd
EVFILT_TIMER timerfd
EVFILT_USER eventfd
EVFILT_PROC, EVFILT_PROCDESC pidfd, alternatively bind processes to a cgroup and monitor cgroup.events
EVFILT_AIO aiocb.aio_fildes (treat as socket)

Hopefully, as you can see, epoll can automatically monitor any kind of kernel resource without having to be modified, due to its composable design, which makes it superior to kqueue from the perspective of having less technical debt.

Interestingly, FreeBSD has added support for Linux’s eventfd recently, so it appears that they may take kqueue in this direction as well.  Between that and FreeBSD’s process descriptors, it seems likely.