A side effect of the whole freenode kerfluffle is that I’ve been looking at IRCD again. IRC, is of course a very weird and interesting place, and the smaller community of people who run IRCDs are largely weirder and even more interesting.
However, in that community of IRCD administrators there happens to be a few incorrect systems programming opinions that have been cargo culted around for years. This particular blog is about one of these bikesheds, namely the kqueue vs epoll debate.
You’ve probably heard it before. It goes something like this, “BSD is better for networking, because it has kqueue. Linux has nothing like kqueue, epoll doesn’t come close.” While I agree that epoll doesn’t come close, I think that’s actually a feature that has lead to a much more flexible and composable design.
In the beginning…
Originally, IRCD like most daemons used
select for polling sockets for readiness, as this was the first polling API available on systems with BSD sockets. The
select syscall works by taking a set of three bitmaps, with each bit describing a file descriptor number: bit 1 refers to file descriptor 1 and so on. The bitmaps are the
err_set, which map to sockets that can be read, written to or have errors accordingly. Due to design defects with the
select syscalls, it can only support up to
FD_SETSIZE file descriptors on most systems. This can be mitigated by making
fd_set an arbitrarily large bitmap and depending on
fdmax to be the upper bound, which is what WinSock has traditionally done on Windows.
select syscall clearly had some design deficits that negatively affected scalability, so AT&T introduced the
poll syscall in System V UNIX. The
poll syscall takes an array of
struct pollfd of user-specified length, and updates a bitmap of flags in each
struct pollfd entry with the current status of each socket. Then you iterate over the
struct pollfd list. This is naturally a lot more efficient than
select, where you have to iterate over all file descriptors up to
fdmax and test for membership in each of the three bitmaps to ascertain each socket’s status.
It can be argued that
select was bounded by
FD_SETSIZE (which is usually 1024 sockets), while
poll begins to have serious scalability issues at around
10240 sockets. These arbitrary benchmarks have been referred to as the C1K and C10K problems accordingly. Dan Kegel has a very lengthy post on his website about his experiences mitigating the C10K problem in the context of running an FTP site.
Then there was kqueue…
In July 2000, Jonathan Lemon introduced kqueue into FreeBSD, which quickly propagated into the other BSD forks as well. kqueue is a kernel-assisted event notification system using two syscalls:
kqueue syscall creates a handle in the kernel represented as a file descriptor, which a developer uses with
kevent to add and remove event filters. Event filters can match against file descriptors, processes, filesystem paths, timers, and so on.
This design allows for a single-threaded server to process hundreds of thousands of connections at once, because it can register all of the sockets it wishes to monitor with the kernel and then lazily iterate over the sockets as they have events.
Most IRCDs have supported
kqueue for the past 15 to 20 years.
And then epoll…
In October 2002, Davide Libenzi got his
epoll patch merged into Linux 2.5.44. Like with kqueue, you use the
epoll_create syscall to create a kernel handle which represents the set of descriptors to monitor. You use the
epoll_ctl syscall to add or remove descriptors from that set. And finally, you use
epoll_wait to wait for kernel events.
In general, the scalability aspects are the same to the application programmer: you have your sockets, you use
epoll_ctl to add them to the kernel’s
epoll handle, and then you wait for events, just like you would with
kqueue, most IRCDs have supported
epoll for the past 15 years.
What is a file descriptor, anyway?
To understand the argument I am about to make, we need to talk about file descriptors. UNIX uses the term file descriptor a lot, even when referring to things which are clearly not files, like network sockets. Outside the UNIX world, a file descriptor is usually referred to as a kernel handle. Indeed, in Windows, kernel-managed resources are given the
HANDLE type, which makes this relationship more clear. Essentially, a kernel handle is basically an opaque reference to an object in kernel space, and the astute reader may notice some similarities to the object-capability model as a result.
Now that we understand that file descriptors are actually just kernel handles, we can now talk about
epoll, and why
epoll is actually the correct design.
The problem with event filters
The key difference between
kqueue is that
kqueue operates on the notion of event filters instead of kernel handles. This means that any time you want
kqueue to do something new, you have to add a new type of event filter.
FreeBSD presently has 10 different event filter types:
EVFILT_USER. Darwin has additional event filters concerning monitoring Mach ports.
EVFILT_EMPTY, all of these different event filter types are related to entirely different concerns in the kernel: they don’t monitor kernel handles, but instead other specific subsystems than sockets.
This makes for a powerful API, but one which lacks composability.
epoll is better because it is composable
It is possible to do almost everything that
kqueue can do on FreeBSD in Linux, but instead of having a single monolithic syscall to handle everything, Linux takes the approach of providing syscalls which allow almost anything to be represented as a kernel handle.
epoll strictly monitors kernel handles, you can register any kernel handle you have with it and get events back when its state changes. As a comparison to Windows, this basically means that
epoll is a kernel-accelerated form of
WaitForMultipleObjects in the Win32 API.
You are probably wondering how this works, so here’s a table of commonly used
kqueue event filters and the Linux syscall used to get a kernel handle for use with
|BSD event filter||Linux equivalent|
||Pass the socket with
Hopefully, as you can see,
epoll can automatically monitor any kind of kernel resource without having to be modified, due to its composable design, which makes it superior to
kqueue from the perspective of having less technical debt.
Interestingly, FreeBSD has added support for Linux’s
eventfd recently, so it appears that they may take
kqueue in this direction as well. Between that and FreeBSD’s process descriptors, it seems likely.