Graphics and System Readings

2021-01-16 - (20 min read)

Blogs and news read during Oct. 2020 to Jan. 2021 that I found interesting. Summaries updated in Feb. 2021.

Disclaimer: Opinions on my own and please judge the credibility by yourself.

[[toc]]

Android Kernel Notes From LPC 2020

https://lwn.net/Articles/830979/ - September 10, 2020

In its early days, the Android project experienced a high-profile disconnect with the kernel community. That situation has since improved considerably, but there are still differences between Android kernels and the mainline. As a result, it is not possible to run Android on a vanilla kernel.

... Vendors pick up this kernel and apply their own changes — often significant, core-kernel changes — to create a vendor kernel ... The end result of all this patching is that every device has its own kernel, meaning that there are thousands of different "Android" kernels in use.

Fragmentation makes it harder to ensure that all devices are running current kernels — or even that they get security updates ... Fixes applied by vendors and OEMs often do not make it back into the mainline, making things worse for everybody.

... The Android developers would like to fix this fragmentation problem ... goal involves providing a single generic kernel in binary form ( Android Generic Kernel Image, a.k.a GKI)

Any vendor-specific or device-specific code that is not in the mainline kernel will need to be shipped in the form of kernel modules to be loaded into the GKI ... Android 11 release requires all devices to ship with kernels based on the Android Common Kernel; Android 12 will require shipping with the GKI instead.

what it takes to boot Android on a mainline kernel ... in the generic case, there is only one patch needed at this point: anonymous VMA naming.

Android is moving to Android Generic Kernel Image to solve the kernel fragmentation problem.

Accurate Timestamps for the ftrace Ring Buffer

https://lwn.net/Articles/831207/ - September 22, 2020

ftrace uses a ring buffer to quickly communicate events to user space ... Until recently, the design of the ring buffer has led to the creation of inaccurate timestamps when events are generated from interrupt handlers.

The ftrace ring buffer was added in 2008 and, a little less than a year later, it became completely lockless.

Writes to a specific per-CPU buffer can only happen on the CPU for that buffer. That ensures that any contention between writers will always be in stack order ... The design of the ring buffer depends on the fact that writers that interrupt other writers will completely finish before the interrupted writer may continue ...

The article talks about a fixing in ftrace's timestamp in handling nested interrupts. The bug being fixed is rare to be triggered. But the article is a good explanation about the design of ftrace, how it achieves both time and memory efficient and what kind of nasty bug can happen in lock free programming.

Resource Management in KDE

https://lwn.net/Articles/834329/ - October 19, 2020

Applications that run on the Linux desktop have changed significantly under the hood in recent years; for example, they use more processes than before.

Some time ago, when a user was running a web browser like Firefox ... the management of running processes was easy. The user could run a ps command and would see just one line of output for each of those applications ... Now, the situation is "very different". When a user opens a Firefox instance they can get a dozen processes.

Fairness is also an increasingly important issue.

... an example of Krita, an advanced graphics application ... all contained within a single process. On the other hand, Discord has those 13 processes, many of which will be making heavy use of the CPU "because it is written in Electron". The system's CPU scheduler will see those two applications as 14 opaque processes, not knowing what they correspond to.

The solution exists in the form of control groups (or cgroups), ... Application processes are started, then they are tagged as belonging to a cgroup ... assigning weights ... to control the CPU time they use.

KDE (and cooperated with GNOME) are making changes to improve the resource management (CPU time, memory, etc.) for desktop applications. Right now, resources are managed to individual processes. However, applications may spawn different number of processes, leading to unfairness for high priority but fewer processed applications.
This is getting to be resolved by using cgroup to manage the resources.

The Arm64 Memory Tagging Extension in Linux

https://lwn.net/Articles/834289/ - October 15, 2020

the Arm64 architecture uses 64-bit pointers to address memory. There is no need (yet!) for an address space that large, though, so normally only 48 of those bits are actually used by the hardware — or 52 bits if a special large-address-space option is enabled ... there are 12-16 bits that can be used for other purposes ... The memory tagging extension (MTE) is one of those uses.

MTE allows the storage of a four-bit "key" in bits 59-56 of a virtual address ... Four bits only allow for 16 distinct key values ... If a function like malloc() ensures that allocations that are adjacent in memory have different key values ... Use-after-free bugs can be detected by changing the key value immediately when a range of memory is freed.

MTE thus has two levels of applicability ... If enabled during the normal software-development process, it should help to identify a range of bugs ... it can also be enabled on production systems to add one more obstacle that an attacker must overcome to exploit a known vulnerability.

MTE is disabled by default on Linux systems, even on hardware that supports it. A user-space process can enable MTE for a specific region of memory ... None of this is helpful to anybody now, though, since hardware with MTE support is not actually shipping yet. The good news is that, once that hardware is available, the software side should be ready for it immediately.

On Arm64, not every 64 bits of a pointer is being used by OS and hardware. The memory tagging extension uses 4 bits of the unused bits to add a key to a pointer. It can be used to

detect incorrect memory free in development and
detect illegal memory dereference issued by attackers. Though devices supporting MTE haven't been manufactured, the Linux driver for MTE is ready.

Kernel Support for Processor Undervolting

https://lwn.net/Articles/835594/ - November 2, 2020

Current processors can run with any of a number of combinations of frequency and voltage, which can change dynamically in a process called dynamic frequency scaling ... It is possible to place a CPU into a configuration outside of its specified operational envelope ... the processor may malfunction in a number of ways, from occasional false results from some instructions to a complete crash.

In the case of Intel chips, the voltage settings are controlled by Model Specific Registers (MSRs) ... On Linux, access to the MSRs from user space is possible using /dev/cpu/CPUID/msr special files ... maintainer of Intel power-related drivers responded ... a proper sysfs interface ... would have to perform checks of the passed values to prevent users from crashing their systems.

Undervolting CPU is useful for saving power and reducing heat. There is no standard way to do undervolting on Linux. Doing undervolting has the risk of malfunctioning the CPU and it can also be exploited by malicious attackers. Discussions are going on about possible future work on undervolting.

KVM for Android

https://lwn.net/Articles/836693/ - November 11, 2020

The hypervisor situation on Android is chaotic ... At least all of the Android devices are running some version of Linux, but in terms of hypervisors, "it's the wild west of fragmentation" ... hypervisors are used in Android today is ... running code outside of Android itself.

Security is hampered because there is an increased trusted computing base (TCB) and it is more difficult to update the devices because of the fragmentation at that level. And functionality is lacking because there is no access to the hardware virtualization features from within Android.

the Android project would like to have a way to de-privilege this third-party code. There is a need for a portable environment that can host these services in a way that is isolated from the Android system.

One way to do that is to move the trusted code into a VM at the same level as the Android system. The idea is to use the GKI effort to introduce KVM as that hypervisor in order to move that third-party code out of the over-privileged trusted region.

The protected KVM project is closely related to Android GKI. Android suffers from Kernel fragmentation issues, but third-party code for digital rights management (DRM), various opaque binary blobs, cryptographic code, etc are even worse fragmented. The protected KVM project is trying to create a virtual environment in AOSP for those third-party code.

epoll_pwait2(), close_range(), and encoded I/O

https://lwn.net/Articles/837816/ - November 20, 2020

The kernel's "epoll" subsystem provides a high-performance mechanism for a process to wait on events from a large number of open file descriptors. ... waiting on events with epoll_wait() or epoll_pwait(). When waiting, the caller can specify a timeout as an integer number of milliseconds.

Nearly 20 years ago, when this work was being done, a millisecond timeout seemed like enough resolution ... In 2020, though, one millisecond can be an eternity;

the patch set instead added a new flag (EPOLL_NSTIMEO) to epoll_create() ... If an epoll file descriptor was created with that flag set, then the timeout value for epoll_wait() would be interpreted as being in nanoseconds rather than milliseconds ... Having one system call set a flag to change how arguments to a different system call would be interpreted was "not very nice" ... After a bit of back and forth, that is what happened ... the patch set adds epoll_pwait2():

New syscall epoll_pwait2() to support nanoseconds event waits.

The close_range() system call was added in the 5.9 release as a way to efficiently close a whole list of file descriptors:

New syscall close_range() to close a range of fds.

Some filesystems have the ability to compress and/or encrypt data written to files ... somebody wanted the ability to work with this "encoded" data directly, bypassing the processing steps within the filesystem code ... With this patch set applied, it becomes possible to read the compressed and/or encrypted data directly and write it directly, with no intervening processing.

New syscalls preadv() and pwritev() to support directly read and write encoded data.

Why printk() is so Complicated (and How to Fix It)

https://lwn.net/Articles/800946/ - October 3, 2019

The difficulties with printk() over the years ... come down to the tension between non-interference and reliability. Non-interference can be addressed by making printk() fully preemptible, making the ring buffer safe in all contexts, and moving console handling to dedicated kernel threads. Reliability, instead, can be achieved by providing a synchronous channel for important messages, an "atomic consoles" concept, and the notion of "emergency messages".

the printk() work starts with the creation of a new ring buffer meant to address the problems ... It is fully lockless, supporting multiple readers and writers in all contexts

add an "atomic console" concept ... would have a write_atomic() method ... This method is defined to operate synchronously ...

Associated with atomic consoles is the idea of "emergency messages" that must go out right away...

Good readings to know about the complexity (surprisingly) behind printk() and some recent progress.

5.10 Merge Window, Part 1

https://lwn.net/Articles/834157/ - October 16, 2020

The Arm v8.5 memory tagging extension is now supported...

The seqcount latch specialized lock type has been added.

Static calls are a mechanism for performing indirect function calls with better performance, especially on systems where retpolines would otherwise have to be used to protect against Spectre vulnerabilities. This mechanism has been under development since 2018; it was finally merged for 5.10.

The printk() subsystem has gained a new lockless ring buffer meant to be a first step in resolving a number of problems in this area.

The minimum version of Clang needed to build the kernel is now 10.0.1.

Several aforementioned patches landed in 5.10. Also exciting to see retpolines landed in the kernel. Compensating some performance loss from "Spectre".

Bootstrappable Builds

https://lwn.net/Articles/841797/ - January 6, 2021

The idea of Reproducible Builds—being able to recreate bit-for-bit identical binaries using the same source code—has gained momentum over the last few years. Reproducible builds provide some safeguards against bad actors in the software supply chain. Minimizing the reliance on opaque binaries for building our software ecosystem is the goal of the Bootstrappable Builds project.

having a way to bootstrap a C compiler, such as GCC, is among the projects that Bootstrappable Builds is pursuing ... One such effort is maintaining a subset of GCC version 4.7, which is the last version that can be built with only a C compiler

A related effort revolves around GNU Mes, which is the combination of a Scheme interpreter written in C and a C compiler written in Scheme. The two parts are mutually self-hosting, so one can be built from the other...

Bootstrap builds means building from source without using any binary tool or libraries. It aims to be a defending against supply chain attacks. I guess it is based on the assumption that a backdoor in a source file is easy to find while a backdoor in a binary file is hard to be detected.

Considering the scale of the recently unveiled supply chain attach to SolarWind, I believe bootstrap builds may get more and more attention.

Existing efforts for bootstrap builds includes maintaining a version of GCC that can compile itself. Beside, there is a brilliant idea to use a Schema written in C and C compiler written in Schema to start with, so they can build from source with the help of an external C or Schema compiler.

Actually I feel it is still possible for the external compiler to inject malicious code to the compiled C or Schema compiler. But with more rounds of C to Schema and Schema to C compiling, it might be practically impossible for the external compiler to influence the final, kind of "converged", compilers. Not sure whether this reasoning is correct.

SWVKC Is A Vulkan-Powered Wayland Compositor Focused On Performance + Correctness

https://www.phoronix.com/scan.php?page=news_item&px=SWVKC-Wayland-Vulkan-Comp - July 26, 2020

One of the leading (among few) examples of a Vulkan-powered window manager / compositor is ChamferWM, which does continue to be developed.

SWVKC meanwhile is one that has been seeing development this year as an alpha-stage Wayland Vulkan compositor.

A quick look on SWVKC's github page. It is still use gbm to buffer management, that makes it not so interesting. IMO it is absolutely possible for the compositor to talk to DRM to get rid of gbm. The ChamferWM linked in the post seems to be a purer Vulkan compositor.

Developing Wayland Color Management and High Dynamic Range

https://www.collabora.com/developing-wayland-color-management - November 19, 2020

Wayland (the protocol and architecture) is still lacking proper consideration for color management... even X11 has not gained support for HDR... This is a story about starting the efforts to fix the situation on Wayland.

The foundation for the color management protocol are ICC profile files for describing both output and color space conversions... HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed...

There are several big and small open questions we haven't had the time to tackle yet... This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

HDR display support has been missing from Linux for a very long time, even X11 hasn't got it yet. There is some progress on working out a HDR/color-management extension in Wayland. The protocol is going to be designed that the compositor will take the control of how to do tone mapping or gamut editing, depending on how 1) monitor config and property and 2) client surface config. So that both SDR surface and HDR surface will be delivered with good quality at the same time.

From Panfrost to production, a tale of Open Source graphics

https://www.collabora.com/from-panfrost-to-production-a-tale - November 03, 2020

the open source stack for Arm's Mali Midgard and Bifrost GPUs ...

Mesa 20.3 -- scheduled for release at the end-of-the-month -- will feature some Bifrost support out-of-the-box.

Reverse-engineered Arm Bifrost stack is coming to the latest Mesa release.

Reverse-Engineering The Apple M1 GPU

https://www.phoronix.com/scan.php?page=news_item&px=Apple-M1-GPU-RE - January 7, 2021

Alyssa Rosenzweig who is known for her work on reverse-engineering Arm GPUs and in particular the multi-year effort so far working on the Panfrost open-source driver stack has taken up an interest in Apple's M1 graphics processor...

RV64X: A Free, Open Source GPU for RISC-V

https://www.eetimes.com/rv64x-a-free-open-source-gpu-for-risc-v/ - January 27, 2021

Somehow reminds me of Intel's failed x86 GPU project.

Attacking the Qualcomm Adreno GPU

https://googleprojectzero.blogspot.com/attacking-qualcomm-adreno-gpu - September 8, 2020

This blog post focuses ... describe an unusual vulnerability in Qualcomm's Adreno GPU, and how it could be used to achieve kernel code execution from within the Android application sandbox ... For sandbox escapes, the GPU offers up a particularly interesting attack surface from the chipset tier. Since GPU acceleration is widely used in applications, the Android sandbox allows full access to the underlying GPU device.

To set up a new shared mapping, the application will ask the KGSL kernel driver for an allocation by calling the IOCTL_KGSL_GPUMEM_ALLOC ioctl. The kernel driver will prepare a region of physical memory, and then map this memory into the GPU's address space ...

Each userland process has its own GPU context, meaning that while a certain application is running operations on the GPU, the GPU will only be able to access mappings that it shares with that process. This is needed so that one application can't ask the GPU to read the shared mappings from another application. In practice this separation is achieved by changing which set of page tables is loaded into the IOMMU whenever a GPU context switch occurs. A GPU context switch occurs whenever the GPU is scheduled to run a command from a different process ... However certain mappings are used by all GPU contexts, and so can be present in every set of page tables. They are called global shared mappings, and are used for a variety of system and debugging functions between the GPU and the KGSL kernel driver.

dump the global mappings (and their GPU virtual addresses) ... scratch buffer has appeared!

two primary usages of the scratch buffer: 1. The GPU address of a preemption restore buffer is dumped to the scratch memory, which appears to be used if a higher priority GPU command interrupts a lower priority command. 2. The read pointer (RPTR) of the ringbuffer (RB) is read from scratch memory and used when calculating the amount of free space in the ringbuffer.

So what happens if the scratch RPTR value is controlled by an attacker? ... For example, we can make the condition ... succeed when it normally wouldn't by artificially increasing the value of the scratch RPTR, which ... results in returning a portion of the ringbuffer that overlaps the correct RPTR location ... That means that an attacker could overwrite ringbuffer commands that haven't yet been processed by the GPU with incoming GPU commands!

... an attacker cannot modify the scratch buffer directly from their malicious/compromised userland process ... What if we could make the GPU hardware write a malicious RPTR value into the scratch buffer on our behalf? ... It turns out that not every global shared mapping can be written to by user-supplied GPU commands, but the scratch buffer can be ... an attacker can also construct these command sequences manually by setting up some GPU shared memory and calling IOCTL_KGSL_GPU_COMMAND ... CP_MEM_WRITE operation writes a constant value to a GPU address

Now that we know we can reliably control the scratch RPTR value ... what does overwriting it buy us?...

Recall that the ringbuffer is used to send commands from the CPU to the GPU. In practice however, user-supplied GPU commands are never placed directly onto the ringbuffer. This is for two reasons: 1) space in the ringbuffer is limited, and user-supplied GPU commands can be very large, and 2) the ringbuffer is readable by all GPU contexts, and so we want to ensure that one process can't read commands from a different process

a layer of indirection occurs, and user-supplied GPU commands are run after an "indirect branch" from the ringbuffer occurs ... Conceptually system level commands are executed straight from the ringbuffer, and user level commands are run after an indirect branch into GPU shared memory. Once the user commands finish, control flow will return to the next ringbuffer operation. The indirect branch is performed with a CP_INDIRECT_BUFFER_PFE operation, which is inserted into the ringbuffer by adreno_ringbuffer_submitcmd. This operation takes two parameters, the GPU address of the branch target (e.g. a GPU shared memory mapping with user-supplied commands in it) and a size value.

Looking further, it looks like the GPU has the IOMMU's "TTBR0" register mapped to a protected mode GPU register as well. By reading the ARM address translation and IOMMU documentation, we can see that TTBR0 is the base address of the page tables used for translating GPU addresses to physical memory addresses. That means if we can point TTBR0 to a set of malicious page tables, then we can translate any GPU address to any physical address of our choosing ... we could make the GPU switch to an attacker controlled page table ... Not only would our GPU commands be able to read and write shared mappings from other processes, we would be able to read and write to any physical address in memory, including kernel memory!

A well explained exploitation on Android that uses Adreno GPU to achieve kernel code execution. The article gives an excellent introduction on how CPU dispatches commands to GPU using RingBuffer and ScratchBuffer. And how a few tampering on ScratchBuffer and IOMMU register fools the GPU to read commands from a malicious controlled page.

Let’s Build a High-performance Fuzzer with GPUs!

https://blog.trailofbits.com/lets-build-a-high-performance-fuzzer-with-gpus - October 22, 2020

Using TLA+ in the Real World to Understand a Glibc Bug

https://probablydance.com/using-tla-in-the-real-world - October 31, 2020