Categories
Uncategorized

it is correct to refer to GNU/Linux as GNU/Linux

You’ve probably seen the “I’d like to interject for a moment” quotation that is frequently attributed to Richard Stallman about how Linux should be referred to as GNU/Linux. While I disagree with that particular assertion, I do believe it is important to refer to GNU/Linux distributions as such, because GNU/Linux is a distinct operating system in the family of operating systems which use the Linux kernel, and it is technically correct to recognize this, especially as different Linux-based operating systems have different behavior, and different advantages and disadvantages.

For example, besides GNU/Linux, there are the Alpine and OpenWrt ecosystems, and last but not least, Android. All of these operating systems exist outside the GNU/Linux space and have significant differences, both between GNU/Linux and also each other.

what is GNU/Linux?

I believe part of the problem which leads people to be confused about the alternative Linux ecosystems is the lack of a cogent GNU/Linux definition, in part because many GNU/Linux distributions try to downplay that they are, in fact, GNU/Linux distributions. This may be for commercial or marketing reasons, or it may be because they do not wish to be seen as associated with the FSF. Because of this, others, who are fans of the work of the FSF, tend to overreach and claim other Linux ecosystems as being part of the GNU/Linux ecosystem, which is equally harmful.

It is therefore important to provide a technically accurate definition of GNU/Linux that provides actual useful meaning to consumers, so that they can understand the differences between GNU/Linux-based operating systems and other Linux-based operating systems. To that end, I believe a reasonable definition of the GNU/Linux ecosystem to be distributions which:

  • use the GNU C Library (frequently referred to as glibc)
  • use the GNU coreutils package for their base UNIX commands (such as /bin/cat and so on).

From a technical perspective, an easy way to check if you are on a GNU/Linux system would be to attempt to run the /lib/libc.so.6 command. If you are running on a GNU/Linux system, this will print the glibc version that is installed. This technical definition of GNU/Linux also provides value, because some drivers and proprietary applications, such as the nVidia proprietary graphics driver, only support GNU/Linux systems.

Given this rubric, we can easily test a few popular distributions and make some conclusions about their capabilities:

  • Debian-based Linux distributions, including Debian itself, and also Ubuntu and elementary, meet the above preconditions and are therefore GNU/Linux distributions.
  • Fedora and the other distributions published by Red Hat also meet the same criterion to be defined as a GNU/Linux distribution.
  • ArchLinux also meets the above criterion, and therefore is also a GNU/Linux distribution. Indeed, the preferred distribution of the FSF, Parabola, describes itself as GNU/Linux and is derived from Arch.
  • Alpine does not use the GNU C library, and therefore is not a GNU/Linux distribution. Compatibility with GNU/Linux programs should not be assumed. More on that in a moment.
  • Similarly, OpenWrt is not a GNU/Linux distribution.
  • Android is also not a GNU/Linux, nor is Replicant, despite being sponsored by the FSF.

on compatibility between distros

Even between GNU/Linux distributions, compatibility is difficult. Different GNU/Linux distributions upgrade their components at different times, and due to dynamic linking, this means that a program built against a specific set of components with a specific set of build configurations may or may not successfully run between GNU/Linux systems, but some amount of binary compatibility is otherwise possible as long as you take care to deal with that.

On top of this, there is no binary compatibility between Linux ecosystems at large. GNU/Linux binaries require the gcompat compatibility framework to run on Alpine, and it generally is not possible to run OpenWrt binaries on Alpine or vice versa. The situation is the same with Android: without a compatibility tool (such as Termux), it is not possible to run binaries from other ecosystems there.

Exacerbating the problem, developers also target specific APIs only available in their respective ecosystems:

  • systemd makes use of glibc-specific APIs, which are not part of POSIX
  • Android makes use of bionic-specific APIs, which are not part of POSIX
  • Alpine and OpenWrt both make use of internal frameworks, and these differ between the two ecosystems (although there are active efforts to converge both ecosystems).

As a result, as a developer, it is important to note which ecosystems you are targeting, and it is important to refer to individual ecosystems, rather than saying “my program supports Linux.” There are dozens of ecosystems which make use of the Linux kernel, and it is unlikely that a program supports all of them, or that the author is even aware of them.

To conclude, it is both correct and important, to refer to GNU/Linux distributions as GNU/Linux distributions. Likewise, it is important to realize that non-GNU/Linux distributions exist, and are not necessarily compatible with the GNU/Linux ecosystem for your application. Each ecosystem is distinct, with its own strengths and weaknesses.

Categories
Uncategorized

the tragedy of gethostbyname

A frequent complaint expressed on a certain website about Alpine is related to the deficiencies regarding the musl DNS resolver when querying large zones. In response, it is usually mentioned that applications which are expecting reliable DNS lookups should be using a dedicated DNS library for this task, not the getaddrinfo or gethostbyname APIs, but this is usually rebuffed by comments saying that these APIs are fine to use because they are allegedly reliable on GNU/Linux.

For a number of reasons, the assertion that DNS resolution via these APIs under glibc is more reliable is false, but to understand why, we must look at the history of why a libc is responsible for shipping these functions to begin with, and how these APIs evolved over the years. For instance, did you know that gethostbyname originally didn’t do DNS queries at all? And, the big question: why are these APIs blocking, when DNS is inherently an asynchronous protocol?

Before we get into this, it is important to again restate that if you are an application developer, and your application depends on reliable DNS performance, you must absolutely use a dedicated DNS resolver library designed for this task. There are many libraries available that are good for this purpose, such as c-ares, GNU adns, s6-dns and OpenBSD’s libasr. As should hopefully become obvious at the end of this article, the DNS clients included with libc are designed to provide basic functionality only, and there is no guarantee of portable behavior across client implementations.

the introduction of gethostbyname

Where did gethostbyname come from, anyway? Most people believe this function came from BIND, the reference DNS implementation developed by the Berkeley CSRG. In reality, it was introduced to BSD in 1982, alongside the sethostent and gethostent APIs. I happen to have a copy of the 4.2BSD source code, so here is the implementation from 4.2BSD, which was released in early 1983:

struct hostent *
gethostbyname(name)
	register char *name;
{
	register struct hostent *p;
	register char **cp;

	sethostent(0);
	while (p = gethostent()) {
		if (strcmp(p->h_name, name) == 0)
			break;
		for (cp = p->h_aliases; *cp != 0; cp++)
			if (strcmp(*cp, name) == 0)
				goto found;
	}
found:
	endhostent();
	return (p);
}

As you can see, the 4.2BSD implementation only checks the /etc/hosts file and nothing else. This answers the question about why gethostbyname and its successor, getaddrinfo do DNS queries in a blocking way: they did not want to introduce a replacement API for gethostbyname that was asynchronous.

the introduction of DNS to gethostbyname

DNS resolution was first introduced to gethostbyname in 1984, when it was introduced to BSD. This version, which is too long to include here also translated dotted-quad IPv4 addresses into a struct hostent. In essence, the 4.3BSD implementation does the following:

  1. If the requested hostname begins with a number, try to parse it as a dotted quad. If this fails, set h_errno to HOST_NOT_FOUND and bail. Yes, this means 4.3BSD would fail to resolve hostnames like 12-34-56-78.static.example.com.
  2. Attempt to do a DNS query using res_search. If the query was successful, return the first IP address found as the struct hostent.
  3. If the DNS query failed, fall back to the original /etc/hosts searching algorithm above, now called _gethtbyname and using strcasecmp instead of strcmp (for consistency with DNS).

A fixed version of this algorithm was also included with BIND’s libresolv as res_gethostbyname, and the res_search and related functions were imported into BSD libc from BIND.

standardization of gethostbyname in POSIX

The gethostbyname and getaddrinfo APIs were first standardized in X/Open Networking Services Issue 4 (commonly referred to as XNS4) specification, which itself was part of the X/Open Single Unix Specification version 3 (commonly referred to as SUSv3), released in 1995. Of note, X/Open tried to deprecate gethostbyname in favor of getaddrinfo as part of the XNS5 specification, removing it entirely except for a mention in their specification for netdb.h.

Later, it returned as part of POSIX issue 6, released in 2004. That version says:

Note: In many cases it is implemented by the Domain Name System, as documented in RFC 1034, RFC 1035, and RFC 1886.

POSIX issue 6, IEEE 1003.1:2004.

Oh no, what is this about, and do application developers need to care about it? Very simply, it is about the Name Service Switch, frequently referred to as NSS, which allows the gethostbyname function to have hotpluggable implementations. The Name Service Switch was a feature introduced to Solaris, which was implemented to allow support for Sun’s NIS+ directory service.

As developers of other operating systems wanted to support software like Kerberos and LDAP, it quickly was reimplemented in other systems as well, such as GNU/Linux. These days, systems running systemd frequently use this feature in combination with a custom NSS module named nss-systemd to force use of systemd-resolved as the DNS resolver, which has different behavior than the original DNS client derived from BIND that ships in most libc implementations.

An administrator can disable support for DNS lookups entirely, simply by editing the /etc/nsswitch.conf file and removing the dns module, which means application developers depending on reliable DNS service need to care a lot about this: it means on systems with NSS, your application cannot depend on gethostbyname to actually support DNS at all.

musl and DNS

Given the background above, it should be obvious by now that musl’s DNS client was written under the assumption that applications that have specific requirements for DNS would be using a specialized library for this purpose, as gethostbyname and getaddrinfo are not really suitable APIs, since their behavior is entirely implementation-defined and largely focused around blocking queries to a directory service.

Because of this, the DNS client was written to behave as simply as possible, but the use of DNS for bulk data distribution, such as in DNSSEC, DKIM and other applications, have led to a desire to implement support for DNS over TCP as an extension to the musl DNS client.

In practice, this will fix the remaining complaints about the musl DNS client once it lands in a musl release, but application authors depending on reliable DNS performance should really use a dedicated DNS client library for that purpose: using APIs that were designed to simply parse /etc/hosts and had DNS support shoehorned into them will always deliver unreliable results.

Categories
Uncategorized

how to refresh older stuffed animals

As many of my readers are likely aware, I have a large collection of stuffed animals, but my favorite one is the first generation Jellycat Bashful Bunny that I have had for the past 10 years or so. Recently I noticed that my bunny was starting to turn purple, likely from the purple stain that is applied to my hair, which bleeds onto anything when given the opportunity to do so. As Jellycat no longer makes the first generation bashfuls (they have been replaced with a second generation that uses a different fabric), I decided that my bunny needed to be refreshed, and as there is not really any good documentation on how to clean a high-end stuffed animal, I figured I would write a blog on it.

understanding what you’re dealing with

What the stuffed animal is made out of is important to know about before coming up with a strategy to refresh it. If the stuffed animal has plastic pellets to help it sit right (which the Jellycat Bashfuls do), then you need to use lower temperatures to ensure the pellets don’t melt. If there are glued on components (as is frequently the case with lower-end stuffed animals), forget about trying this and just buy a new one.

If the stuffed animal has vibrant colors, you should probably avoid using detergent, or, at the very least, you should use less detergent than you would normally. These vibrant colors are created by staining white fabric, rather than dyeing it, in other words, the pigment is sitting on the surface of the fabric, rather than being part of the fabric itself. As with plastic components, you should use lower temperatures too, as the pigment used in these stains tends to wash away if the temperature is warm enough (around 40 celsius or so).

the washing process

Ultimately I decided to play it safe and wash my stuffed bunny with cold water, some fabric softener and a tide pod. However, the spin cycle was quite concerning to me, as it spins quite fast and with a lot of force. To ensure that the bunny was not harmed by the spin cycle, I put him in a pillowcase and tied the end of it. Put the washing machine on the delicate program to ensure it spends the least amount of time in the spin cycle as possible. Also, I would not recommend washing a stuffed animal with other laundry.

Come back in 30 minutes after the program completes, and put the stuffed animal in the dryer. You should remove the stuffed animal from the pillowcase at this time and dry both the animal and the pillowcase separately. Put the dryer on the delicate program again, and be prepared to run it through multiple cycles. In the case of my bunny, it took a total of two 45 minute cycles to completely dry.

Once done, your stuffed animal should be back to its usual self, and with the tumble drying, it will likely be a little bit fuzzier than it was before, kind of like it came from the factory.

Bonus content: 1 minute of a tumbling bunny.
Categories
Uncategorized

JSON-LD is ideal for Cloud Native technologies

Frequently I have been told by developers that it is impossible to have extensible JSON documents underpinning their projects, because there may be collisions later. For those of us who are unaware of more capable graph serializations such as JSON-LD and Turtle, this seems like a reasonable position. Accordingly, I would like to introduce you all to JSON-LD, using a practical real-world deployment as an example, as well as how one might use JSON-LD to extend something like OCI container manifests.

You might feel compelled to look up JSON-LD on Google before continuing with reading this. My suggestion is to not do that, because the JSON-LD website is really aimed towards web developers, and this explanation will hopefully explain how a systems engineer can make use of JSON-LD graphs in practical terms. And, if it doesn’t, feel free to DM me on Twitter or something.

what JSON-LD can do for you

Have you ever wanted any of the following in the scenarios where you use JSON:

  • Conflict-free extensibility
  • Strong typing
  • Compatibility with the RDF ecosystem (e.g. XQuery, SPARQL, etc)
  • Self-describing schemas
  • Transparent document inclusion

If you answered yes to any of these, then JSON-LD is for you. Some of these capabilities are also provided by the IETF’s JSON Schema project, but it has a much higher learning curve than JSON-LD.

This post will be primarily focused on how namespaces and aliases can be used to provide extensibility while also providing backwards compatibility for clients that are not JSON-LD aware. In general, I believe strongly that any open standard built on JSON should actually be built on JSON-LD, and hopefully my examples will demonstrate why I believe this.

ActivityPub: a real-world case study

ActivityPub is a protocol that is used on the federated social web (thankfully entirely unrelated to Web3), that is built on the ActivityStreams 2.0 specification. Both ActivityPub and ActivityStreams are RDF vocabularies that are represented as JSON-LD documents, but you don’t really need to know or care about this part.

This is a very simplified representation of an ActivityPub actor object:

{
  "@context": [
    "https://www.w3.org/ns/activitystreams",
    {
      "alsoKnownAs": {
        "@id": "as:alsoKnownAs",
        "@type": "@id"
      },
      "sec": "https://w3id.org/security#",
      "owner": {
        "@id": "sec:owner",
        "@type": "@id"
      },
      "publicKey": {
        "@id": "sec:publicKey",
        "@type": "@id"
      },
      "publicKeyPem": "sec:publicKeyPem",
    }
  ],
  "alsoKnownAs": "https://corp.example.org/~alice",
  "id": "https://www.example.com/~alice",
  "inbox": "https://www.example.com/~alice/inbox",
  "name": "Alice",
  "type": "Person",
  "publicKey": {
    "id": "https://www.example.com/~alice#key",
    "owner": "https://www.example.com/~alice",
    "publicKeyPem": "..."
  }
}

Pay attention to the @context variable here, it is doing a few things:

  1. It pulls in the entire ActivityStreams and ActivityPub vocabularies by reference. These can be downloaded on the fly or bundled with the application using context preloading.
  2. It then defines a few terms outside of those vocabularies: alsoKnownAs, sec, owner, publicKey and publicKeyPem.

When an application that is JSON-LD aware parses this document, it will receive a document that looks like this:

{
  "@context": [
    "https://www.w3.org/ns/activitystreams",
    {
      "alsoKnownAs": {
        "@id": "as:alsoKnownAs",
        "@type": "@id"
      },
      "sec": "https://w3id.org/security#",
      "owner": {
        "@id": "sec:owner",
        "@type": "@id"
      },
      "publicKey": {
        "@id": "sec:publicKey",
        "@type": "@id"
      },
      "publicKeyPem": "sec:publicKeyPem",
    }
  ],
  "@id": "https://www.example.com/~alice",
  "@type": "Person",
  "as:alsoKnownAs": "https://corp.example.org/~alice",
  "as:inbox": "https://www.example.com/~alice/inbox",
  "as:name": "Alice",
  "sec:publicKey": {
    "@id": "https://www.example.com/~alice#key",
    "sec:owner": "https://www.example.com/~alice",
    "sec:publicKeyPem": "..."
  }
}

This allows extensions to interoperate with minimal conflicts, as the application is operating on a normalized version of the document that has as many things namespaced as possible, without the user having to worry about it. This allows a parser to easily ignore things it does not know about, as they aren’t defined in the context (which does not actually have to be defined, you can preload a root context), and so they aren’t placed in a namespace.

In other words, that @context variable can be built into the application, or stored in an S3 bucket somewhere, or whatever you want to do. If you are planning to have an interoperable protocol, however, providing a useful @context is crucial.

How OCI image manifests could benefit from JSON-LD

There was a discussion on Twitter this evening about how extending the OCI image spec with signature references has taken a year. If OCI used JSON-LD (ironically, its JSON vocabulary is already similar to several pre-existing JSON-LD ones), then implementations could just store the pre-existing metadata, mapped to a namespace. In the case of an OCI image, this might look something like:

{
  "@context": [
    "https://opencontainers.org/ns",
    {
      "sigstore": "https://sigstore.dev/ns",
      "reference": {
        "@type": "@id",
        "@id": "sigstore:reference"
      }
    }
  ],
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:d539cd357acb4a6df2a4ef99db5fe70714458349232dad0ec73e1ed65f6a0e13",
    "size": 585
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:59bf1c3509f33515622619af21ed55bbe26d24913cedbca106468a5fb37a50c3",
      "size": 2818413
    },
    {
      "mediaType": "application/vnd.example.signature+json",
      "size": 3514,
      "digest": "sha256:19387f68117dbe07daeef0d99e018f7bbf7a660158d24949ea47bc12a3e4ba17",
      "reference": {
        "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
        "digest": "sha256:59bf1c3509f33515622619af21ed55bbe26d24913cedbca106468a5fb37a50c3",
        "size": 2818413
      }
    }
  ]
}

The differences are minimal from a current OCI image manifest. Namely, schemaVersion has been deleted, because JSON-LD handles this detail automatically, and the signature reference extension has been added as the sigstore:reference property. Hopefully you can imagine how the rest of the document looks namespace wise.

One last thing about this example. You might notice that I am using URIs when I define namespaces in the @context. This is a great feature of the RDF ecosystem: you can put up a webpage at those URIs defining how to make use of the terms defined in the namespace, meaning that JSON-LD tooling can have rich documentation built in.

Also, since I am well aware that basically all of these OCI tools are written in Go, it should be noted that Go has an excellent implementation of JSON-LD, and for those concerned that W3C proposals are sometimes not in touch with reality, the creator of JSON-LD has some words about it that are interesting. Now, please, use JSON-LD and stop worrying about extensibility in open technology, this problem is totally solved.

Categories
Uncategorized

how I wound up causing a major outage of my services and destroying my home directory by accident

As a result of my FOSS maintenance and activism work, I have a significant IT footprint, to support the services and development environments needed to facilitate everything I do. Unfortunately, I am also my own system administrator, and I am quite terrible at this. This is a story about how I wound up knocking most of my services offline and wiping out my home directory, because of a combination of Linux mdraid bugs and a faulty SSD. Hopefully this will be helpful to somebody in the future, but if not, you can at least share in some catharsis.

A brief overview of the setup

As noted, I have a cluster of multiple servers, ranging from AMD EPYC machines to ARM machines to a significant interest in a System z mainframe which I talked about at AlpineConf last year. These are used to host various services in virtual machine and container form, with the majority of the containers being managed by kubernetes in the current iteration of my setup. Most of these workloads are backed by an Isilon NAS, but some workloads run on local storage instead, typically for performance reasons.

Using kubernetes seemed like a no-brainer at the time because it would allow me to have a unified control plane for all of my workloads, regardless of where (and on what architecture) they would be running. Since then, I’ve realized that the complexity of managing my services with kubernetes was not justified by the benefits I was getting from using kubernetes for my workloads, and so I started migrating away from kubernetes back to a traditional way of managing systems and containers, but many services are still managed as kubernetes containers.

A Samsung SSD failure on the primary development server

My primary development server, is named treefort. It is an x86 box with AMD EPYC processors and 256 GB of RAM. It had a 3-way RAID-1 setup using Linux mdraid on 4TB Samsung 860 EVO SSDs. I use KVM with libvirt to manage various VMs on this server, but most of the server’s resources are dedicated to the treefort environment. This environment also acts as a kubernetes worker, and is also the kubernetes controller for the entire cluster.

Recently I had a stick of RAM fail on treefort. I ordered a replacement stick and had a friend replace it. All seemed well, but then I decided to improve my monitoring so that I could be alerted to any future hardware failures, as having random things crash on the machine due to uncorrected ECC errors is not fun. In the process of implementing this monitoring, I learned that one of the SSDs had fallen out of the RAID.

I thought it was a little weird that one drive failed out of the three, so I assumed it was just due to maintenance, perhaps the drive had been reseated after the RAM stick was replaced, after all. As the price of a replacement 4TB Samsung SSD is presently around $700 retail, I thought I would re-add the drive to the array, assuming it would fail out of the array again during rebuild if it had actually failed.

# mdadm —-manage /dev/md2 —-add /dev/sdb3
mdadm: added /dev/sdb3

I then checked /proc/mdstat and it reported the array as healthy. I thought nothing of it, though in retrospect maybe I should have found this suspicious, there was no discussion about the array being in a recovery state, instead it was healthy, with three drives present. Unfortunately, I figured “ok, I guess it’s fine” and left it at that.

Silent data corruption

Meanwhile, the filesystem in the treefort environment being backed by the local SSD storage for speed reasons, began to silently corrupt itself. Because most of my services, such as my mail server, DNS and network monitoring, are running on other hosts, there wasn’t really any indicator of anything wrong. Things seemed to be basically working fine: I had been compiling kernels all week long as I tested various mitigations for the execve(2) issue. What I didn’t know at the time was that with each kernel compile I was slowly corrupting the disk more and more.

I was not aware of the data corruption issue until today, anyway, when I logged into the treefort environment, and decided to fire up nano to finish up some work I had been doing that needed to be resolved this week. That led me to have a rude surprise:

treefort:~$ nano
Segmentation fault

This worried me, after all, why could nano crash if it were working yesterday, and nothing had changed? So, I used apk fix to reinstall nano, making it work again. At this point, I was quite suspicious, that something was up with the server, so I immediately killed all the guests running on it, and focused on the bare metal host environment (what we would call the dom0 if we were still using Xen).

I ran e2fsck -f on the treefort volumes and hoped for the best. Instead of a clean bill of health, I got lots of filesystem errors. But this still didn’t make any sense to me, I checked the array again, and it was still showing as fully healthy. Accordingly, I decided to run e2fsck -fy on the volumes and hope for the best. This took out the majority of the volume storing my home directory.

The loss of the kubernetes controller

Kubernetes is a fickle beast, it assumes you have set everything up with redundancy including, of course, redundant controllers. I found this out the hard way when I took treefort offline, and the worker nodes got confused and took the services they were running offline as well, presumably because they were unable to talk to the controller.

Eventually, with some help from friends, I was able to recover enough of the volume to allow the system to boot enough to get the controller back up and running enough to restore the services on the workers that were not treefort, but much like the data in my home directory, the services that were running on treefort are likely permanently lost.

Some thoughts

First of all, it is obvious I need to improve my backup strategy from something other than “I’ll figure it out later”. I plan on packaging Rich Felker’s bakelite tool to do just that.

The other big elephant in the room, of course, is “why weren’t you using ZFS in the first place”. While it is true that Alpine has supported ZFS for years, I’ve been hesitant to use it due to the CDDL licensing. In other words, I chose the mantra instilled in me about GPL compatibility since the days when I was using GNU/Linux over pragmatism. And my prize for that decision was this mess. While I think Oracle and the Illumos and OpenZFS contributors should come together to relicense the ZFS codebase under MPLv2 to solve the GPL compatibility problem, I am starting to think that I should care more about having a storage technology I can actually trust.

I’m also quite certain that the issue I hit is a bug in mdraid, but perhaps I am wrong. I am told that there is a dirty bitmap system and perhaps if all bitmaps are marked clean on both the good pair of drives and the bad drive, it can cause this kind of split-brain issue, but I feel like there should be timestamping on those bitmaps to prevent something like this. It’s better to have an unnecessary rebuild because of clock skew than to go split brain and have 33% of all reads causing silent data corruption due to being out of sync with the other disks.

Nonetheless, my plans are to rebuild treefort with ZFS and SSDs from another vendor. Whatever happened with the Samsung SSDs has made me anxious enough that I don’t want to trust them for continued production use.

Categories
Uncategorized

CVE-2021-4034

A few days ago, Qualys dropped CVE-2021-4034, which they have called “Pwnkit”. While Alpine itself was not directly vulnerable to this issue due to different engineering decisions made in the way musl and glibc handle SUID binaries, this is intended to be a deeper look into what went wrong to enable successful exploitation on GNU/Linux systems.

a note on blaming systemd

Before we get into this, I have seen a lot of people on Twitter blaming systemd for this vulnerability. It should be clarified that systemd has basically nothing to do with polkit, and has nothing at all to do with this vulnerability, systemd and polkit are separate projects largely maintained by different people.

We should try to be empathetic toward software maintainers, including those from systemd and polkit, so writing inflammatory posts blaming systemd or its maintainers for polkit does not really help to fix the problems that made this a useful security vulnerability.

the theory behind exploiting CVE-2021-4034

For an idea of how one might exploit CVE-2021-4034, lets look at blasty’s “blasty vs pkexec” exploit. Take a look at the code for a few minutes, and come back here. There are multiple components to this exploit that have to all come together to make it work. A friend on IRC described it as a “rube goldberg machine” when I outlined it to him.

The first component of the exploit is the creation of a GNU iconv plugin: this is used to convert data from one character set to another. The plugin itself is the final step in the pipeline, and is used to gain the root shell.

The second component of the exploit is using execve(2) to arrange for pkexec to be run in a scenario where argc < 1. Although some POSIX rules lawyers will argue that this is a valid execution state, because the POSIX specification only says that argv[0] should be the name of the program being run, I argue that it is really a nonsensical execution state under UNIX, and that defensive programming against this scenario is ridiculous, which is why I sent a patch to the Linux kernel to remove the ability to do this.

The third component of the exploit is the use of GLib by pkexec. GLib is a commonly used C development framework, and it contains a lot of helpful infrastructure for developers, but that framework comes at the cost of a large attack surface, which is undesirable for an SUID binary.

The final component of the exploit is the design decision of the GLIBC authors to attempt to sanitize the environment of SUID programs rather than simply ignore known-harmful environmental variables when running as an SUID program. In essence, Qualys figured out a way to bypass the sanitization entirely. When these things combine, we are able to use pkexec to pop a root shell, as I will demonstrate.

how things went wrong

Now that we have an understanding of what components are involved in the exploit, we can take a look at what happens from beginning to end. We have our helper plugin, which launches the root shell, and we have an understanding of the underlying configuration and its design flaws. How does all of this come together?

The exploit itself does not happen in blasty-vs-pkexec.c, that just sets up the necessary preconditions for everything else to fall into place, and then runs pkexec. But it runs pkexec in a way that basically results in an execution state that could be described as a weird machine: it uses execve(2) to launch it in an execution state where there are no arguments provided, not even an argv[0].

Because pkexec is running in this weird state that it was never designed to run in, it executes as normal, except that we wind up in a situation where argv[1] is actually the beginning of the program’s environment. The first value in the environment is lol, which is a valid argument, but not a valid environment variable, since it is missing a value. If we run pkexec lol in a terminal, we get:

[kaniini@localhost ~]$ pkexec lol
Cannot run program lol: No such file or directory

The reason why this is interesting is because that message is actually generated by g_log(), and that’s where the fun begins. In initializing the GLog subsystem, there is a code path where g_utf8_validate() gets called on argv[0]. When running as a weird machine, this validation fails, because argv[0] is NULL. This results in GLib trying to convert argv[0] to UTF-8, which uses iconv, a libc function.

On GLIBC, the iconv function is provided by the GNU libiconv framework, which supports loading plugins to add additional character sets, from a directory specified as GCONV_PATH. Normally, GCONV_PATH is removed from an SUID program’s environment because GLIBC sanitizes the environment of SUID programs, but Qualys figured out a way to glitch the sanitization, and so GCONV_PATH remains in the environment. As a result, we get a root shell as soon as it tries to convert argv[0] to UTF-8.

where do we go from here?

On Alpine and other musl-based systems, we do not use GNU libiconv, so we are not vulnerable to blasty’s PoC, and musl also makes a more robust decision: instead of trying to sanitize the environment of SUID programs, it just ignores variables which would lead to musl loading additional code, such as LD_PRELOAD entirely when running in SUID mode.

This means that ultimately three things need to be fixed: pkexec itself should be fixed (which has already been done), to close the vulnerability on older kernels, the kernel itself should be fixed to disallow this weird execution state (which my patch does), and GLIBC should be fixed to ignore dangerous environmental variables instead of trying to sanitize them.

Categories
Uncategorized

the FSF’s relationship with firmware is harmful to free software users

The FSF has an unfortunate relationship with firmware, resulting in policies that made sense in the late 1980s, but actively harm users today, through recommending obsolescent equipment, requiring increased complexity in RYF-certified hardware designs and discouraging both good security practices and the creation of free replacement firmware. As a result of these policies, deficient hardware often winds up in the hands of those who need software freedom the most, in the name of RYF-certification.

the FSF and microcode

The normal Linux kernel is not recommended by the FSF, because it allows for the use of proprietary firmware with devices. Instead, they recommend Linux-libre, which disables support for proprietary firmware by ripping out code which allows for the firmware to be loaded on to devices. Libreboot, being FSF-recommended, also has this policy of disallowing firmware blobs in the source tree, despite it being a source of nothing but problems.

The end result is that users who deploy the FSF-recommended firmware and kernel wind up with varying degrees of broken configurations. Worse yet, the Linux-libre project removes warning messages which suggest a user may want to update their processor microcode to avoid Meltdown and Spectre security vulnerabilities.

While it is true that processor microcode is a proprietary blob, from a security and reliability point of view, there are two types of CPU: you can have a broken CPU, or a less broken CPU, and microcode updates are intended to give you a less broken CPU. This is particularly important because microcode updates fix real problems in the CPU, and Libreboot has patches which hack around problems caused by deficient microcode burned into the CPU at manufacturing time, since it’s not allowed to update the microcode at early boot time.

There is also a common misconception about the capabilities of processor microcode. Over the years, I have talked with numerous software freedom advocates about the microcode issue, and many of them believe that microcode is capable of reprogramming the processor as if it were an FPGA or something. In reality, the microcode is a series of hot patches to the instruction decode logic, which is largely part of a fixed function execution pipeline. In other words, you can’t microcode update a CPU to add or substantially change capabilities.

By discouraging (or outright inhibiting in the case of Linux-libre) end users to exercise their freedom (a key tenet of software freedom being that the user has agency to do whatever she wants with her computer) to update their processor microcode, the FSF pursues a policy which leaves users at risk for vulnerabilities such as Meltdown and Spectre, which were partially mitigated through a microcode update.

Purism’s Librem 5: a case study

The FSF “Respects Your Freedom” certification has a loophole so large you could drive a truck through it called the “secondary processor exception”. This is because it knows that generally speaking, entirely libre devices do not presently exist that have the capabilities people want. Purism used this loophole to sell a phone that had proprietary software blobs while passing it off as entirely free. The relevant text of the exception that allowed them to do this was:

However, there is one exception for secondary embedded processors. The exception applies to software delivered inside auxiliary and low-level processors and FPGAs, within which software installation is not intended after the user obtains the product. This can include, for instance, microcode inside a processor, firmware built into an I/O device, or the gate pattern of an FPGA. The software in such secondary processors does not count as product software.

Purism was able to accomplish this by making the Librem 5 have not one, but two processors: when the phone first boots, it uses a secondary CPU as a service processor, which loads all of the relevant blobs (such as those required to initialize the DDR4 memory) before starting the main CPU and shutting itself off. In this way, they could have all the blobs they needed to use, without having to worry about them being user visible from PureOS. Under the policy, that left them free and clear for certification.

The problem of course is that by hiding these blobs in the service processor, users are largely unaware of their existence, and are unable to leverage their freedom to study, reverse engineer and replace these blobs with libre firmware, a remedy that would typically be made available to them as part of the four freedoms.

This means that users of the Librem 5 phone are objectively harmed in three ways: first, they are unaware of the existence of the blobs to begin with, second they do not have the ability to study the blobs, and third, they do not have the ability to replace the blobs. By pursing RYF certification, Purism released a device that is objectively worse for the practical freedom of their customers.

The irony, of course, is that Purism failed to gain certification at the end of this effort, creating a device that harmed consumer freedoms, with increased complexity, just to attempt to satisfy the requirements of a certification program they ultimately failed to gain certification from.

The Novena laptop: a second case study

In 2012, Andrew “bunnie” Huang began a project to create a laptop with the most free components he could find, called the Novena open laptop. It was based on the Freescale (now NXP) i.MX 6 CPU, which has an integrated Vivante GPU and WiFi radio. Every single component in the design had data sheets freely available, and the schematic itself was published under a free license.

But because the SoC used required blobs to boot the GPU and WiFi functionality, the FSF required that these components be mechanically disabled in the product in order to receive certification, despite an ongoing effort to write replacement firmware for both components. This replacement firmware was eventually released, and people are using these chips with that free firmware today.

Had bunnie chosen to comply with the RYF certification requirements, customers which purchased the Novena laptop would have been unable to use the integrated GPU and WiFi functionality, as it was physically disabled on the board, despite the availability of free replacement firmware for those components. Thankfully, bunnie chose not to move forward on RYF certification, and thus the Novena laptop can be used with GPU acceleration and WiFi.

the hardware which remains

In practice, it is difficult to get anything much more freedom-respecting than the Novena laptop. From a right-to-repair perspective, the Framework laptop is very good, but it still uses proprietary firmware. It is, however, built on a modern x86 CPU, and could be a reasonable target for corebooting, especially now that the embedded controller firmware’s source code has been released under a free license.

However, because of the Intel ME, the Framework laptop will rightly never be RYF-certified. Instead, the FSF promotes buying old thinkpads from 2009 with Libreboot pre-installed. This is a total disservice to users, as a computer from 2009 is totally obsolete now, and as discussed above, Intel CPUs tend to be rather broken without their microcode updates.

My advice is to ignore the RYF certification program, as it is actively harmful to the practical adoption of free software, and just buy whatever you can afford that will run a free OS well. At this point, total blob-free computing is a fool’s errand, so there are a lot of AMD Ryzen-based machines that will give you decent performance and GPU acceleration without the need for proprietary drivers. Vendors which use coreboot for their systems and open the source code for their embedded controllers should be at the front of the line. But the FSF will never suggest this as an option, because they have chosen unattainable ideological purity over the pragmatism of recommending what the market can actually provide.

Categories
Uncategorized

delegation of authority from the systems programming perspective

As I have been griping on Twitter lately, about how I dislike the design of modern UNIX operating systems, an interesting conversation about object capabilities came up with the author of musl-libc. This conversation caused me to realize that systems programmers don’t really have a understanding of object capabilities, and how they can be used to achieve environments that are aligned with the principle of least authority.

In general, I think this is largely because we’ve failed to effectively disseminate the research output in this area to the software engineering community at large — for various reasons, people complete their distributed systems degrees and go to work in decentralized finance, as unfortunately, Coinbase pays better. An unfortunate reality is that the security properties guaranteed by Web3 platforms are built around object capabilities, by necessity – the output of a transaction, which then gets consumed for another transaction, is a form of object capability. And while Web3 is largely a planet-incinerating Ponzi scheme run by grifters, object capabilities are a useful concept for building practical security into real-world systems.

Most literature on this topic try to describe these concepts in the framing of, say, driving a car: by default, nobody has permission to drive a given car, so it is compliant with the principle of least authority, meanwhile the car’s key can interface with the ignition, and allow the car to be driven. In this example, the car’s key is an object capability: it is an opaque object, that can be used to acquire the right to drive the car. Afterwards, they usually go on to describe the various aspects of their system without actually discussing why anybody would want this.

the principle of least authority

The principle of least authority is hopefully obvious, it is the idea that a process should only have the rights that are necessary for the process to complete. In other words, the calculator app shouldn’t have the right to turn on your camera or microphone, or snoop around in your photos. In addition, there is an expectation that the user should be able to express consent and make a conscious decision to grant rights to programs as they request them, but this isn’t necessarily required: a user can delegate her consent to an agent to make those decisions on her behalf.

In practice, modern web browsers implement the principle of least authority. It is also implemented on some desktop computers, such as those running recent versions of macOS, and it is also implemented on iOS devices. Android has also made an attempt to implement security primitives that are aligned with the principle of least authority, but it has various design flaws that mean that apps have more authority in practice than they should.

the object-capability model

The object-capability model refers to the use of object capabilities to enforce the principle of least authority: processes are spawned with a minimal set of capabilities, and then are granted additional rights through a mediating service. These rights are given to the requestor as an opaque object, which it references when it chooses to exercise those rights, in a process sometimes referred to as capability invocation.

Because the object is opaque, it can be represented in many different ways: as a digital signature (like in the various blockchain platforms), or it can simply be a reference to a kernel handle (such as with Capsicum’s capability descriptors). Similarly, invocation can happen directly, or indirectly. For example, a mediation service which responds to a request to turn on the microphone with a sealed file descriptor over SCM_RIGHTS, would still be considered an object-capability model, as it is returning the capability to listen to the microphone by way of providing a file descriptor that can be used to read the PCM audio data from the microphone.

Some examples of real-world object capability models include: the Xen hypervisor’s xenbus and event-channels, Sandbox.framework in macOS (derived from FreeBSD’s Capsicum), Laurent Bercot’s s6-sudod, the xdg-portal specification from freedesktop.org and privilege separation as originally implemented in OpenBSD.

capability forfeiture, e.g. OpenBSD pledge(2)/unveil(2)

OpenBSD 5.9 introduced the pledge(2) system call, which allows a process to voluntarily forfeit a set of capabilities, by restricting itself to pre-defined groups of syscalls. In addition, OpenBSD 6.4 introduced the unveil(2) syscall, which allows a process to voluntarily forfeit the ability to perform filesystem I/O except to a pre-defined set of paths. In the object capability vocabulary, this is referred to as forfeiture.

The forfeiture pattern is seen in UNIX daemons as well: because you historically need root permission (or in Linux, the CAP_NET_BIND_SERVICE process capability bit) to bind to a port lower than 1024, a daemon will start as root, and then voluntarily drop itself down to an EUID which does not possess root privileges, effectively forfeiting them.

Because processes start with high levels of privilege and then gives up those rights, this approach is not aligned with the principle of least authority, but does have tangible security benefits in practice.

Linux process capabilities

Since Linux 2.2, there has been a feature called process capabilities, the CAP_NET_BIND_SERVICE mentioned above is one of the capabilities that can be set on a binary. These are basically used as an alternative to setting the SUID bit on binaries, and are totally useless outside of that use case. They have nothing to do with object capabilities, although an object capability system could be designed to facilitate many of the same things SUID binaries presently do today.

Further reading

Fuchsia uses a combination of filesystem namespacing and object capability mediation to restrict access to specific devices on a per-app basis. This could be achieved on Linux as well with namespaces and a mediation daemon.

Categories
Uncategorized

glibc is still not Y2038 compliant by default

Most of my readers are probably aware of the Y2038 issue by now. If not, it refers to 3:14:07 UTC on January 19, 2038, when 32-bit time_t will overflow. The Linux kernel has internally switched to 64-bit timekeeping several years ago, and Alpine made the jump to 64-bit time_t with the release of Alpine 3.13.

In the GNU/Linux world, the GNU libc started to support 64-bit time_t in version 2.34. Unfortunately for the rest of us, the approach they have used to support 64-bit time_t is technically deficient, following in the footsteps of other never-fully-completed transitions.

the right way to transition to new ABIs

In the musl C library, which Alpine uses, and in many other UNIX C library implementations, time_t is always 64-bit in new code, and compatibility stubs are provided for code requiring the old 32-bit functions. As code is rebuilt over time, it automatically becomes Y2038-compliant without any effort.

Microsoft went another step further in msvcrt: you get 64-bit time_t by default, but if you compile with the _USE_32BIT_TIME_T macro defined, you can still access the old 32-bit functions.

It should be noted that both approaches described above introduce zero friction to get the right thing.

how GNU handles transitions in GNU libc

The approach the GNU project have taken for transitioning to new ABIs is the exact opposite: you must explicitly ask for the new functionality, or you will never get it. This approach leaves open the possibility for changing the defaults in the future, but they have never ever done this so far.

This can be observed with the large file support extension, which is needed to handle files larger than 2GiB, you must always build your code with -D_FILE_OFFSET_BITS=64. And, similarly, if you’re on a 32-bit system, and you do not build your app with -D_TIME_BITS=64, it will not be built using an ABI that is Y2038-compliant.

This is the worst possible way to do this. Consider libraries: what if a dependency is built with -D_TIME_BITS=64, and another dependency is built without, and they need to exchange struct timespec or similar with each other? Well, in this case, your program will likely crash or have weird behavior, as you’re not consistently using the same struct timespec in the program you’ve compiled.

Fortunately, if you are targeting 32-bit systems, and you would like to manipulate files larger than 2GiB or have confidence that your code will continue to work in the year 2038, there are Linux distributions that are built on musl, like Alpine, which will allow you to not have to deal with the idiosyncrasies of the GNU libc.

Categories
Uncategorized

stop defining feature-test macros in your code

If there is any change in the C world I would like to see in 2022, it would be the abolition of #define _GNU_SOURCE. In many cases, defining this macro in C code can have harmful side effects ranging from subtle breakage to miscompilation, because of how feature-test macros work.

When writing or studying code, you’ve likely encountered something like this:

#define _GNU_SOURCE
#include <string.h>

Or worse:

#include <stdlib.h>
#include <unistd.h>
#define _XOPEN_SOURCE
#include <string.h>

The #define _XOPEN_SOURCE and #define _GNU_SOURCE in those examples are defining something known as a feature-test macro, which is used to selectively expose function declarations in the headers. These macros are necessary because some standards have conflicting definitions of functions and thus are aliased to other symbols, allowing co-existence of the conflicting functions, but only one version of that function may be defined at a time, so the feature-test macros allow the user to select which definitions they want.

The correct way to use these macros is by defining them at compile time with compiler flags, e.g. -D_XOPEN_SOURCE or -std=gnu11. This ensures that the declared feature-test macros are consistently defined while compiling the project.

As for the reason why #define _GNU_SOURCE is a thing? It’s because we have documentation which does not correctly explain the role of feature-test macros. Instead, in a given manual page, you might see language like “this function is only enabled if the _GNU_SOURCE macro is defined.”

To find out the actual way to use those macros, you would have to read feature_test_macros(7), which is usually not referenced from individual manual pages, and while that manual page shows the incorrect examples above as bad practice, it understates how much of a bad practice it actually is, and it is one of the first code examples you see on that manual page.

In conclusion, never use #define _GNU_SOURCE, always use compiler flags for this.