actually, BSD kqueue is a mountain of technical debt
A side effect of the whole freenode kerfluffle is that I’ve been looking at IRCD again. IRC, is of course a very weird and interesting place, and the smaller community of people who run IRCDs are largely weirder and even more interesting.
However, in that community of IRCD administrators there happens to be a few incorrect systems programming opinions that have been cargo culted around for years. This particular blog is about one of these bikesheds, namely the kqueue vs epoll debate.
You’ve probably heard it before. It goes something like this, “BSD is better for networking, because it has kqueue. Linux has nothing like kqueue, epoll doesn’t come close." While I agree that epoll doesn’t come close, I think that’s actually a feature that has lead to a much more flexible and composable design.
In the beginning…
Originally, IRCD like most daemons used select
for polling sockets for readiness, as this was the first polling API available on systems with BSD sockets. The select
syscall works by taking a set of three bitmaps, with each bit describing a file descriptor number: bit 1 refers to file descriptor 1 and so on. The bitmaps are the read_set
, write_set
and err_set
, which map to sockets that can be read, written to or have errors accordingly. Due to design defects with the select
syscalls, it can only support up to FD_SETSIZE
file descriptors on most systems. This can be mitigated by making fd_set
an arbitrarily large bitmap and depending on fdmax
to be the upper bound, which is what WinSock has traditionally done on Windows.
The select
syscall clearly had some design deficits that negatively affected scalability, so AT&T introduced the poll
syscall in System V UNIX. The poll
syscall takes an array of struct pollfd
of user-specified length, and updates a bitmap of flags in each struct pollfd
entry with the current status of each socket. Then you iterate over the struct pollfd
list. This is naturally a lot more efficient than select
, where you have to iterate over all file descriptors up to fdmax
and test for membership in each of the three bitmaps to ascertain each socket’s status.
It can be argued that select
was bounded by FD_SETSIZE
(which is usually 1024 sockets), while poll
begins to have serious scalability issues at around 10240
sockets. These arbitrary benchmarks have been referred to as the C1K and C10K problems accordingly. Dan Kegel has a very lengthy post on his website about his experiences mitigating the C10K problem in the context of running an FTP site.
Then there was kqueue…
In July 2000, Jonathan Lemon introduced kqueue into FreeBSD, which quickly propagated into the other BSD forks as well. kqueue is a kernel-assisted event notification system using two syscalls: kqueue
and kevent
. The kqueue
syscall creates a handle in the kernel represented as a file descriptor, which a developer uses with kevent
to add and remove event filters. Event filters can match against file descriptors, processes, filesystem paths, timers, and so on.
This design allows for a single-threaded server to process hundreds of thousands of connections at once, because it can register all of the sockets it wishes to monitor with the kernel and then lazily iterate over the sockets as they have events.
Most IRCDs have supported kqueue
for the past 15 to 20 years.
And then epoll…
In October 2002, Davide Libenzi got his epoll
patch merged into Linux 2.5.44. Like with kqueue, you use the epoll_create
syscall to create a kernel handle which represents the set of descriptors to monitor. You use the epoll_ctl
syscall to add or remove descriptors from that set. And finally, you use epoll_wait
to wait for kernel events.
In general, the scalability aspects are the same to the application programmer: you have your sockets, you use epoll_ctl
to add them to the kernel’s epoll
handle, and then you wait for events, just like you would with kevent
.
Like kqueue
, most IRCDs have supported epoll
for the past 15 years.
What is a file descriptor, anyway?
To understand the argument I am about to make, we need to talk about file descriptors. UNIX uses the term file descriptor a lot, even when referring to things which are clearly not files, like network sockets. Outside the UNIX world, a file descriptor is usually referred to as a kernel handle. Indeed, in Windows, kernel-managed resources are given the HANDLE
type, which makes this relationship more clear. Essentially, a kernel handle is basically an opaque reference to an object in kernel space, and the astute reader may notice some similarities to the object-capability model as a result.
Now that we understand that file descriptors are actually just kernel handles, we can now talk about kqueue
and epoll
, and why epoll
is actually the correct design.
The problem with event filters
The key difference between epoll
and kqueue
is that kqueue
operates on the notion of event filters instead of kernel handles. This means that any time you want kqueue
to do something new, you have to add a new type of event filter.
FreeBSD presently has 10 different event filter types: EVFILT_READ
, EVFILT_WRITE
, EVFILT_EMPTY
, EVFILT_AIO
, EVFILT_VNODE
, EVFILT_PROC
, EVFILT_PROCDESC
, EVFILT_SIGNAL
, EVFILT_TIMER
and EVFILT_USER
. Darwin has additional event filters concerning monitoring Mach ports.
Other than EVFILT_READ
, EVFILT_WRITE
and EVFILT_EMPTY
, all of these different event filter types are related to entirely different concerns in the kernel: they don’t monitor kernel handles, but instead other specific subsystems than sockets.
This makes for a powerful API, but one which lacks composability.
epoll
is better because it is composable
It is possible to do almost everything that kqueue
can do on FreeBSD in Linux, but instead of having a single monolithic syscall to handle everything, Linux takes the approach of providing syscalls which allow almost anything to be represented as a kernel handle.
Since epoll
strictly monitors kernel handles, you can register any kernel handle you have with it and get events back when its state changes. As a comparison to Windows, this basically means that epoll
is a kernel-accelerated form of WaitForMultipleObjects
in the Win32 API.
You are probably wondering how this works, so here’s a table of commonly used kqueue
event filters and the Linux syscall used to get a kernel handle for use with epoll
.
BSD event filter | Linux equivalent |
---|---|
EVFILT_READ , EVFILT_WRITE , EVFILT_EMPTY |
Pass the socket with EPOLLIN etc. |
EVFILT_VNODE |
inotify |
EVFILT_SIGNAL |
signalfd |
EVFILT_TIMER |
timerfd |
EVFILT_USER |
eventfd |
EVFILT_PROC , EVFILT_PROCDESC |
pidfd , alternatively bind processes to a cgroup and monitor cgroup.events |
EVFILT_AIO |
aiocb.aio_fildes (treat as socket) |
Hopefully, as you can see, epoll
can automatically monitor any kind of kernel resource without having to be modified, due to its composable design, which makes it superior to kqueue
from the perspective of having less technical debt.
Interestingly, FreeBSD has added support for Linux’s eventfd
recently, so it appears that they may take kqueue
in this direction as well. Between that and FreeBSD’s process descriptors, it seems likely.