Neko

Buzzing with eBPF

This post is a republish of the following blog that was written as part of an eBPF workshop that my friend Navneet Nayak and I held on 5th September 2024 at our university.


eBPF is a technology that’s being used for tracing, networking, security, and observability in innovative ways. Let’s learn about how you can give your kernel superpowers using eBPF!

Link to the repository containing all of the workshop and post-workshop resources: https://github.com/homebrew-ec-foss/eBPF-workshop


Introduction

Before we begin, we’d like to mention that this article is an introduction to the world of eBPF. It is imperative that you explore eBPF and its associated topics on your own. If you’re curious about anything mentioned here, just search about it to learn more. A great starting point would be all the links to the research papers, websites and books linked in this article. Happy learning!

eBPF stands for the extended Berkeley Packet Filter. Since the name implies that it is an extension of the Berkeley Packet Filter (BPF), it only makes sense that we understand what BPF is first. To do that, let’s,

Rewind back to 1992

The 1990s, a time of tremendous technological innovation and progress that began reshaping the world. With the advent of the internet, new challenges emerged, including the need for effective packet filtering. This need arose from the necessity to protect systems from malicious actors and manage network traffic more efficiently. Let’s take a brief look at the state of network packet filtering back then.

Most packet filtering tools ran in the userspace of a machine, meaning that any packet received would have to be copied from the kernel-space to the userspace where applications would filter them.

The userspace packet filtering applications needed to make system calls to the kernel to perform almost any operation such as reading packets. Needless to say, this method was extremely inefficient and slow.

Userspace packet filtering
  1. Filtering was very slow due to packets being copied to user space. Filtering applications had to issue system calls to perform actions on the packet.
  2. Limited flexibility with respect to setting filters, providing very little customisation.
  3. Many CPU instructions were required to filter even a single packet.

Note: Packet filters for UNIX systems that used the STREAMS and NIT frameworks provided by the UNIX kernel functioned like this.


The userspace and kernel

The userspace and kernel

A computer usually comprises of application software (such as games, browsers) and the hardware on which they run. The hardware and application software do not directly interact with each other. Instead, the software running in the userspace utilises system resources by interacting with the kernel, an intermediate layer between applications and hardware, which acts on its behalf. The applications interact with the kernel via the system call interface to perform operations such as reading memory, writing to disk or sending data over a network. This means the kernel of a system has direct access to the hardware, making any program running in the kernel much faster than code run in user space. This also makes the kernel a vantage point to monitor or modify application functionality, since most actions an application performs must go through the kernel.


In comes BPF!

BPF was a novel approach to filter packets completely in the kernel!

Steven McCanne and Van Jacobson at the Lawrence Berkeley Laboratory came up with the BSD Packet filter in 1993. It was introduced to Linux in 1997, in kernel version 2.1.75, as the Berkeley Packet Filter (BPF). BPF was the first to implement a register-based filter evaluator(VM) in the kernel. The BPF virtual machine emulates a register-based CPU and has 10 registers, each with a specific role!

The BPF virtual machine has a 32-bit instruction set similar to assembly which allows for defining complex filters easily and with immense flexibility. For instance, these are the BPF instructions to filter out any packets not using the Internet Protocol(IP):

ldh     [12]
jeq     #ETHERTYPE IP, L1, L2
L1:     ret     #TRUE
L2:     ret     #0

These instructions run in the BPF virtual machine and can be efficiently converted to native machine code, sandboxed from the kernel. BPF showed huge improvements in speed over existing solutions and provided greater flexibility to write custom programs for filtering.

This led to the widespread adoption of BPF, in famous tools like tcpdump). The original BPF paper (https://www.tcpdump.org/papers/bpf-usenix93.pdf) is a great read to learn more about the thought process behind designing BPF. It mentions existing solutions such as STREAMS NIT packet filters and another stack-based VM packet filter called CSPF, and how BPF was a far more efficient approach to filtering.


eBPF

eBPF stands for extended Berkeley Packet Filter. While the name might imply that it is just a faster and more efficient extension to BPF, it is slightly misleading, since eBPF can do so much more! Imagine taking the BPF virtual machine, it’s instruction set and running BPF programs anywhere in the kernel! The possibilities are truly endless. eBPF does exactly that. It takes the core features of BPF, but extends BPF to beyond just packet filtering. eBPF programs can be attached to various kernel hooks, probes in kernel functions, and tracepoints in the kernel.

eBPF programs can be executed almost anywhere in the kernel!

You can think of eBPF as a mini computer running anywhere in the kernel, executing user code at lightning fast speeds.

eBPF programs in the kernel

Hooks

As the name suggests, kernel hooks are used to hook programs to some part of the kernel. eBPF is event-based and a program is executed whenever an event occurs.

There are various types of eBPF hooks, ranging from networking-specific hooks like socket_filter and xdp, to more general hooks that can be used to attach to any kernel function, like kprobes and even tracepoints, which are marked locations in the kernel code. Some of these hooks, like kprobes and tracepoints, have existed well before BPF (You can learn more about them here: https://docs.kernel.org/trace/)

You can find the full list of eBPF program types by running the command

sudo bpftool feature list_builtins prog_types

eBPF can be used for a variety of applications such as:

  1. Performance tracing of pretty much any aspect of a system
  2. High-performance networking, with built-in visibility
  3. Detecting and (optionally) preventing malicious activity

eBPF was first introduced in kernel version 3.18 in 2014, with significant improvements over classical BPF:

  1. The BPF instruction set was completely overhauled to be more efficient on 64-bit machines, and the interpreter was entirely rewritten.
  2. eBPF maps were introduced, which are data structures that can be accessed by BPF programs running in the kernel as well as user space applications, thus allowing information to be shared among them.
  3. And the important part: The eBPF verifier was added to ensure that the programs are safe to run in the kernel.
Image of the loading and verification process

Image of the loading and verification process

As seen above, all eBPF programs go through the process of verification and JIT-compilation after which they can be attached to specific hooks. There are various libraries which provide helper functions, handle kernel interfaces and simplify the process of loading and attaching programs. Some of these include bcc for Python, libbpf for C and cilium/ebpf for Go. Bpftrace is another high-level tracing language that utilises BCC and libbpf. We will be making use of the libbpf library in this workshop.


Programming the kernel

In the previous sections, you might have been left wondering why the eBPF virtual machine and verifier are a big deal. Why not add our own code to the kernel to run any program directly with zero overhead? Or develop Loadable Kernel Modules(LKM)? Turns out, these alternatives have their cons.

Modifying the Kernel Source

Modifying the kernel source code is a highly challenging task. The Linux kernel has more than 30 million lines of code written over several years by thousands of contributors. Even if someone successfully edited the source to add their own functionality, there are other hurdles. Linux being a community project, all changes have to be approved by the community and specifically, Linus Torvalds. On an average, only about a third of all kernel patches are accepted, after which it is released in the next kernel version (which may take a while).

Even after jumping through all these hoops, there is no guarantee that users will be able to utilise the merged feature. Several distributions use an older version of the kernel and wait for newer versions to be thoroughly tested before releasing it.

Comic on modifying the kernel

Loadable Kernel Modules

The kernel also provides for a modular approach to extend functionality by writing LKMs that can be loaded and unloaded when required. While modules do not require modifying the kernel source directly, they have their disadvantages:

  1. Writing LKMs still requires significant experience with kernel programming.
  2. Since any code running in the kernel is privileged, a module crash would bring the entire kernel to a halt.
  3. External LKMs might have security vulnerabilities, with serious consequences due to its privileged nature.

eBPF programs

eBPF provides a better alternative to program the kernel by solving the problems faced by the above approaches:

Security

Dynamic loading

Ease of use and Community

The recent Crowdstrike blunder that took down servers around the world could have been avoided if a sandboxing mechanism to run kernel code was used (https://thenewstack.io/crowdstrike-a-wake-up-call-for-ebpf-based-endpoint-security/). There is an ongoing project to bring eBPF to Windows https://github.com/microsoft/ebpf-for-windows.

Comic on simplicity of eBPF programs

Let's dive in!

eBPF can be used for a variety of purposes, with the website listing case-studies of production usage here. We will cover two specific uses of eBPF in this workshop - Observability and Networking.

Observability

eBPF’s Hello World!

Let’s get our hands dirty and write a simple eBPF Hello World program of sorts. We’ll write a program that will print out Hello World every time any application makes a system call.

//hello_kern.bpf.c

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("raw_tracepoint/sys_enter")
int helloworld(void *ctx) {
	bpf_printk("Hello World!\n");
	return 0;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

Note: It is completely optional to add .bpf to the file extension. It is usually included to easily identify bpf files and can be omitted, with just the .c extension.

Let’s understand this Hello World Program!

Now onto the next step - compiling the C program into BPF byte code. To do this, run the following command:

clang -target bpf -I/usr/include/$(uname -m)-linux-gnu -g -O2 -c hello_kern.bpf.c -o hello_kern.bpf.o

Here, we assume that the eBPF program file has been named hello_kern.bpf.c and the output object file to be generated will be called hello_kern.bpf.o.

The next step is to load the bpf object into the kernel and attach it to the sys_enter raw_tracepoint. This can be done by making use of bpftool, which will save us the hassle of manually writing C code to load and attach the eBPF code to the kernel.

sudo bpftool prog load hello_kern.bpf.o /sys/fs/bpf/prog autoattach

Believe it or not, your eBPF program is now running in the kernel, and runs everytime any system call is made by any application on your system! However, you might be wondering why there is no Hello World being printed. Since the eBPF code is running in the kernel, it isn’t directly printed to the terminal. Instead, the messages are logged to trace pipes in the kernel which can be viewed with the following command:

sudo cat /sys/kernel/debug/tracing/trace_pipe

We can also use bpftool to check the trace log:

sudo bpftool prog tracelog

That's a lot to messages being printed to the console, lets press Ctrl + C to stop and read the individual logs. Awesome! The sheer number of messages goes to show the amount of syscalls being made to the kernel at any given moment, which makes tracing them all the more valuable!


Let’s investigate a little deeper

Now that we’ve written our first eBPF program, let’s take a step back and get our hands a little dirtier. We can check out the actual eBPF byte code and register instructions using llvm-objdump.

llvm-objdump -S hello_kern.bpf.o

This produces the following output:

hello_kern.bpf.o:	file format elf64-bpf

Disassembly of section raw_tracepoint/sys_enter:

0000000000000000 <helloworld>:
;   bpf_printk("Hello World!\n");
       0:	18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00	r1 = 0x0 ll
       2:	b7 02 00 00 0e 00 00 00	r2 = 0xe
       3:	85 00 00 00 06 00 00 00	call 0x6
;   return 0;
       4:	b7 00 00 00 00 00 00 00	r0 = 0x0
       5:	95 00 00 00 00 00 00 00	exit

We can also check out the native instructions converted to assembly by the JIT compiler:

sudo bpftool prog dump jited name helloworld
int helloworld(void * ctx):
bpf_prog_ad7f62a5e7675635_hello:
; bpf_printk("Hello World!\n");
   0:	nopl	(%rax,%rax)
   5:	nop
   7:	pushq	%rbp
   8:	movq	%rsp, %rbp
   b:	movabsq	$-114700180084464, %rdi
  15:	movl	$14, %esi
  1a:	callq	0xffffffffca5d7fb0
; return 0;
  1f:	xorl	%eax, %eax
  21:	leave
  22:	retq
  23:	int3

Pretty cool!


Improving our hello world program

Right now, we're only printing “Hello World” each time a system call is made. Let’s say we want to keep track of the number of system calls an application makes on your system.

In order to do this, we’re going to make use of eBPF maps, a core new feature introduced by eBPF. These data structures allow for data storage and communication among eBPF programs as well as between eBPF programs and userspace applications.

eBPF Maps

We can define an eBPF map to keep track of the number of system calls made by each process using its PID (Process ID)

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, __u32);
    __type(value, __u64);
    __uint(max_entries, 1024);
} syscall_count_map SEC(".maps");

This defines a bpf map of type ‘HASH’ that allows us to store key value pairs, where the keys are of 32 bit integers (which will store the respective PIDs) and values are the number of syscalls made by the particular process. Once again, we’re using the SEC macro to add a map section to the object file.

Now for the main counting program:

SEC("raw_tracepoint/sys_enter")
int count_syscalls(void *ctx) {
    /* Helper function to get the current PID, also gets the current gid, hence
     we do a bit shift 32 bits to the right */
    __u64 pid = bpf_get_current_pid_tgid() >> 32;
    __u64 *count;

  count = bpf_map_lookup_elem(&syscall_count_map, &pid);
  if (count != NULL) {
	    // If pid already exists in the map, update the count
      *count += 1;
      bpf_map_update_elem(&syscall_count_map, &pid, count, BPF_ANY);
  } else {
	    // Otherwise initialise it to zero
      __u64 initial_count = 1;
      bpf_map_update_elem(&syscall_count_map, &pid, &initial_count, BPF_ANY);
  }

	return 0;
}

Here, we’re using a couple of helper functions to lookup and update values in the map, which require pointers to the map, the key, the value and an option variable, in the same order.

The complete program can be found below:

//syscall_counter_kern.bpf.c

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct {
	__uint(type, BPF_MAP_TYPE_HASH);
	__type(key, __u32);
	__type(value, __u64);
	__uint(max_entries, 1024);
} syscall_count_map SEC(".maps");

SEC("raw_tracepoint/sys_enter")
int count_syscalls(void *ctx) {
	__u64  pid = bpf_get_current_pid_tgid() >> 32;
	__u64 *count;

  count = bpf_map_lookup_elem(&syscall_count_map, &pid);
  if (count != NULL) {
      *count += 1;
      bpf_map_update_elem(&syscall_count_map, &pid, count, BPF_ANY);
  } else {
      __u64 initial_count = 1;
      bpf_map_update_elem(&syscall_count_map, &pid, &initial_count, BPF_ANY);
  }

	return 0;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

Next, we need to compile and attach the program, to do this we can use the same commands we used for the 'Hello World' program (the clang and bpftool load commands). Just Make sure to remove the previously attached eBPF program:

sudo rm /sys/fs/bpf/prog

Alternatively you can use the makefile provided in the github repository that will automatically do this compilation, by running the command sudo make

We can check whether the program has been successfully loaded and attached by using the command:

sudo bpftool prog list

You should see something like this:

72: raw_tracepoint  name count_syscalls  tag 32d619111ac61bf9  gpl
	loaded_at 2024-09-01T22:18:42+0530  uid 0
	xlated 368B  jited 210B  memlock 4096B  map_ids 22,24
	btf_id 123

Here, the first ‘72’ is the program id. It lists two maps associated with this program. In this example, the IDs of the maps are ‘22’ and ‘24’ respectively, which might differ on your systems. Use the map IDs to checkout the maps with the below command:

sudo bpftool map dump id 22

Great! Now we can see how the BPF program was able to trace the number of syscalls a process made, with each pid being listed with the corresponding number of syscalls it made.

eBPF in Production: Observability

eBPF is being used to trace all sorts of activity on linux systems and servers across the world right now, from keeping track of application health, tracing outliers in system calls made by processes, to observing memory, CPU and power usage.


Networking

Previously, we looked into how eBPF could be used for tracing, particularly to monitor system calls. Let's now move onto another use-case: Networking. eBPF is extensively used in networking applications for a variety of uses, ranging from load balancing to network security. There are several network-related hooks at various levels in the kernel such as the eXpress Data Path(XDP), Traffic Control (TC) and socket hooks. In this workshop, we will be attaching our programs to the XDP hook.

The XDP hook is used to process packets as soon as it arrives on a network interface and before it enters the kernel networking stack. This allows for efficient and speedy packet-processing, as the packet is not copied or transported through the kernel. eBPF programs can even be offloaded to supported network interface cards for faster processing. Thus, the packet routines are completed before it even enters the kernel!

XDP programs have the following general structure:

  1. Intercept the packet received at the network interface
  2. Read or modify the packet contents
  3. Decide on a verdict as to what must be done with the packet, such as:
    • XDP_PASS: Pass the packet to the kernel networking stack as it would have in the absence of the XDP program
    • XDP_DROP: Discard the packet immediately
    • XDP_TX: Send the packet out through the same interface
    • Other verdicts include XDP_REDIRECT and XDP_ABORTED

We'll start off by writing a simple program to trace all Internet Protocol(IPv4) packets received at the network interface.

Note: All packets from 192.168.5.2 are filtered out from the trace logs. This is the loopback address of the host set by the Lima virtual machine (prerequisite for Mac users). The packets from the host to the VM clutter the trace log, hence we do not print them.

#include "vmlinux.h"
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>

// 0x0800 indicates that the packet is an IPv4 packet
#define ETH_P_IP 0x0800

// Converting the IP Address to the unsigned int format to filter out Lima VM
// network packets
#define IP_ADDRESS(x) (unsigned int)(192 + (168 << 8) + (5 << 16) + (x << 24))

SEC("xdp")
int trace_net(struct xdp_md *ctx) {
  // The casting to long before void* is done to ensure
  // compatibility between 32 and 64-bit systems
  void *data = (void *)(long)ctx->data;
  void *data_end = (void *)(long)ctx->data_end;

  struct ethhdr *eth_hdr = data;
  // This check to verify that the ethernet header is contained within the
  // network packet is necessary to pass the eBPF verifier
  if (data + sizeof(struct ethhdr) > data_end)
    return XDP_PASS;

  // bpf_ntohs converts the network packet's byte order type to
  // the host's byte order type
  if (bpf_ntohs(eth_hdr->h_proto) == ETH_P_IP) {

    // This check to verify that the IP header is contained within the
    // network packet is necessary to pass the eBPF verifier
    struct iphdr *ip_hdr = data + sizeof(struct ethhdr);
    if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) > data_end)
      return XDP_PASS;

    // Packet protocol values
    // 1 = ICMP
    // 6 = TCP
    // 17 = UDP

    // 192.168.5.2 is the loopback address of the host set by Lima vm
    // The packets from the host to the VM clutter the trace log, hence we do
    // not print them
    if (ip_hdr->saddr != IP_ADDRESS(2)) {

      // printk format specifier for IP addresses:
      // https://www.kernel.org/doc/html/v4.20/core-api/printk-formats.html
      bpf_printk("Src: %pI4, Dst: %pI4, Proto: %d", &ip_hdr->saddr,
                 &ip_hdr->daddr, ip_hdr->protocol);
    }
  }

  return XDP_PASS;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

XDP programs start with the xdp section. Similar to the previous syscall tracer, the program has access to an xdp_md context struct that holds information about each individual packet received at the network interface.

struct xdp_md {
  __u32 data;
  __u32 data_end;
  __u32 data_meta;
  /* Below access go through struct xdp_rxq_info */ __u32
      ingress_ifindex;  /* rxq->dev->ifindex */
  __u32 rx_queue_index; /* rxq->queue_index */
  __u32 egress_ifindex; /* txq->dev->ifindex */
};

Here, the data and data_end fields store pointers to the start and end of the packet respectively. All of the packet data is contained within this region. We parse the ethernet header from the packet, which has the following structure (Here, be refers to the Big-endian byte order):

struct ethhdr {
unsigned char h_dest[ETH_ALEN];   /* destination eth addr	*/
unsigned char h_source[ETH_ALEN]; /* source ether addr	*/
__be16 h_proto;                   /* packet type ID field	*/
}
__attribute__((packed));

Before the header data is accessed, we must include checks to ensure that we do not access memory outside of the packet. In the absence of this check, the program will not be accepted by the eBPF verifier.

The bpf_ntohs() function converts the network packet's byte order type (usually Big-Endian) to the host's byte order type (usually Little-Endian). If the packet is an Internet Protocol packet, we then parse the IP headers and perform a similar memory-access check to prevent reading data outside of the packet.

The IP header struct defined by the kernel has the following fields:

struct iphdr {
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u8 ihl : 4, version : 4;
#elif defined(__BIG_ENDIAN_BITFIELD)
__u8 version : 4, ihl : 4;
#else
#error "Please fix <asm/byteorder.h>"
#endif

__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;

__struct_group(/* no tag */, addrs, /* no attrs */,
               __be32 saddr;
               __be32 daddr;);
/*The options start here. */
};

We make use of the saddr(source address), daddr (destination address) and protocol (such as TCP and UDP) fields in this program and print them. Notice that we make use of %pI4, which is a format specifier used to print IPv4 addresses. Finally, we return XDP_PASS, indicating that the packet must be passed to the kernel as usual.

To load and attach this program, we shall use the commands from the following Makefile. Save the following file with the name Makefile in the same directory as the program. Run sudo make to compile, load and attach this program to the network interface.

TARGET = net
INTERFACE  = eth0
ARCH = $(shell uname -m | sed 's/x86_64/x86/' | sed 's/aarch64/arm64/')

BPF_OBJ = ${TARGET:=.bpf.o}

all: $(TARGET) $(BPF_OBJ)
.PHONY: all
.PHONY: $(TARGET)

$(TARGET): $(BPF_OBJ)
	bpftool net detach xdp dev $(INTERFACE)
	rm -f /sys/fs/bpf/$(TARGET)
	bpftool prog load $(BPF_OBJ) /sys/fs/bpf/$(TARGET)
	bpftool net attach xdp pinned /sys/fs/bpf/$(TARGET) dev $(INTERFACE)

$(BPF_OBJ): %.o: %.c vmlinux.h
	clang \
	    -target bpf \
	    -D __BPF_TRACING__ \
		-I/usr/include/$(shell uname -m)-linux-gnu \
	    -Wall \
	    -O2 -o $@ -c $<

vmlinux.h:
	bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

clean:
	- bpftool net detach xdp dev $(INTERFACE)
	- rm -f /sys/fs/bpf/$(TARGET)
	- rm $(BPF_OBJ)

The program logs all messages to the kernel trace pipe, which can be viewed using:

sudo cat /sys/kernel/debug/tracing/trace_pipe

Hurray! We see the packets and their metadata being displayed in the trace logs. To detach and unload the program from the interface, run sudo make clean.

eBPF in Production: Networking

eBPF is widely used by organisations to build efficient and versatile networking systems.


Conclusion

We hope you had a fun time learning about eBPF! However the learning does not stop here. We have several post-workshop activities listed in the Github repository as well as additional resources such as books, papers and labs. We highly recommend that you explore and dive in deeper. Feel free to reach out with any queries or thoughts you might want to share.

Happy hacking!