Piotr Jaroszyński: Usermode debugging under Linux

How usermode under Linux is done

Intro

Porting any code to a substantially different environment is the hardest when no other ports have been done yet. Fortunately gPXE already supports two ARCHs (i386 and x86_64) and two PLATFORMs (pcbios on i386 and efi on both). Because of efi and pcbios differences extra layers making up for them have been already introduced. That makes the linux usermode port, despite being quite different conceptually (usermode versus hardware), a much easier task.

Before focusing on the specific layers (called subsystems later for a lack of a better name), let's look at how the necessary kernel interface is provided first (it's not as trivial as one might think).

Kernel API

Regardless of the specific usage (discussed later in subsystems) some way of accessing the kernel is necessary.

Background

Because of gPXE nature it was designed and implemented to be completely self-contained. It doesn't link to stdlib (glibc) or to any other library. That's a nice feature to have considering the crazy size constraints it has to meet. For example it allows to compile gPXE with -mregparm=3 and -mrtd flags, which reduce code size, but also make it incompatible with code compiled without them.

On the other hand availability of stdlib apis was necessary to make the programming environment feel natural and hence many of them were reimplemented internally.

linux_ prefix to the rescue

To avoid confusion (and in many cases collisions) between gPXE internals and kernel interface it was decided that all of the kernel API functions will be prefixed with linux_. For example:

include/linux_api.h:

extern int linux_open(const char *pathname, int flags);
extern int linux_close(int fd);

include/gpxe/posix_io.h:

extern int open(const char *uri_string);
extern int close(int fd);

Linking to stdlib (glibc)

UPDATE: That approach has been moved to a separate linuxlibc PLATFORM and is available on the linuxlibc branch.

Despite being non-trivial, forcing some compile flags to be disabled (namely -mrtd and -mregparm mentioned earlier) and having some other problems linking to stdlib was still the quickest for prototyping. It will also come in handy when debugging problems with the other superior approach.

To work around the symbol collisions with stdlib, all the neccessary libs are copied with the offending symbols prefixed with linux_. objcopy with –redefine-syms=remap_file is used to achieve that.

An example line from remap_file simply says:

read linux_read

All the build/linker details can be seen in the arch/x86/Makefile.linux:

MEDIA = linux

STDLIBS_BEGIN = $(BIN)/remapped_crt1.o $(BIN)/remapped_crti.o $(BIN)/remapped_crtbeginT.o
STDLIBS_LIBS = $(BIN)/remapped_libc.a $(BIN)/remapped_libgcc.a $(BIN)/remapped_libgcc_eh.a
STDLIBS_LIBS_L = $(foreach lib, $(STDLIBS_LIBS), -l:$(lib))
STDLIBS_END = $(BIN)/remapped_crtend.o $(BIN)/remapped_crtn.o

SYMBOLS_REMAP = arch/x86/linux/symbols_remap

$(BIN)/remapped_% : $(SYMBOLS_REMAP)
        $(QM)$(ECHO) "  [REMAP] $*"
        $(Q)objcopy --redefine-syms=$(SYMBOLS_REMAP) $(shell gcc $(CFLAGS) --print-file-name $*) $@

.PRECIOUS : $(BIN)/remapped_%

TGT_EXTRA_DEPS += $(STDLIBS_BEGIN) $(STDLIBS_LIBS) $(STDLIBS_END)
TGT_LD_FLAGS_PRE += -static $(STDLIBS_BEGIN)
TGT_LD_FLAGS_POST += --start-group $(STDLIBS_LIBS_L) --end-group $(STDLIBS_END)

$(BIN)/%.linux : $(BIN)/%.linux.tmp
        $(QM)$(ECHO) "  [FINISH] $@"
        $(Q)cp -p $< $@

Linker script

Amazingly the default ld scripts work just with the addition of tables (see include/gpxe/table.h) in the .data section:

  .data           :
  {
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
    *(SORT(.tbl.*))
  }

Prefix

stdlib's _start takes care of everything so the prefix code is empty.

Being self-contained

To overcome the problems with linking to stdlib we need to implement some of its elementary features ourselves.

Linker script

A good read for starters is Using ld, the Gnu Linker. With that backgrund the currently used linker scirpts (arch/*/scripts/*.lds) should make more sense.

As we are not going to be linking against stdlib, the linker script should be really simple. In fact it turned out that there is already a simple enough linker script used for efi (arch/x86/scripts/efi.lds) that can be used more or less out of the box. The only necessary modification is setting the start of the Text segment properly, because not every value works (you can try 0x0 and see :) We can see what's the convention by looking at how the default linker script does it by passing –verbose to ld while compiling a simple program in 32bit and 64bit mode.

$ gcc -m32 foo.c -o foo -Wl,--verbose
$ gcc -m64 foo.c -o foo -Wl,--verbose

From that we can gather that i386 uses 0x08048000 and x86_64 uses 0x400000 as the start address. I haven't been able to find a good explanation on why these are used in particular. Moreover many other values also seem to be working. Other way of figuring out the specific values is reading i386 ABI (page 48) and AMD64 ABI (page 26).

Prefix (_start)

_start being the default ENTRY point is the very first thing that's executed when a new process receives control. What we want to do in _start is the minimal work necessary to actually call our main() function.

To accomplish that we need to know 3 things:

  • What's the state of things when _start is executed
  • How to actually call main()
  • What to do when main() returns

The state of the stack and registers at the time of _start execution is descrbed in i386 ABI (page 54) and AMD64 ABI (page 28).

The function calling convention is also desribed in the ABI docs: i386 ABI (pages 36-38) and AMD64 ABI (pages 15-23). A nice overview is calling conventions.

What we need to do after main() returns is to call the exit syscall. Details on that are in the next section.

To actually make use of all that information we need to learn GNU Assembler first though. I haven't been able to find any too good docs on it and certainly nothing resembling a tutorial. Look at quick syntax, manual and manual2.

Following simplified _starts should make sense now:

arch/i386/prefix/linuxprefix.S:

_start:
        xorl    %ebp, %ebp // ABI wants us to zero the base frame
 
        popl    %esi       // save argc
        movl    %esp, %edi // save argv
 
        pushl   %edi // argv -> C arg2
        pushl   %esi // argc -> C arg1
 
        call    main
 
        movl    %eax, %ebx // rc -> syscall arg1
        movl    $__NR_exit, %eax
        int     $0x80

arch/x86_64/prefix/linuxprefix.S:

_start:
        xorq    %rbp, %rbp // ABI wants us to zero the base frame
 
        popq    %rdi       // argc -> C arg1
        movq    %rsp, %rsi // argv -> C arg2
 
        call    main
 
        movq    %rax, %rdi // rc -> syscall arg1
        movq    $__NR_exit, %rax
        syscall

Syscalls

To provide the necessary kernel API (functions declared in include/linux_api.h) we need a way to perform syscalls.

A simple way of doing that is implementing our own int syscall(int number, …); as long linux_syscall(int number, …); and using that as the building block.

The syscall calling conventions is a bit different than normal function calling convention on both i386 and x86_64. The AMD64 ABI (pages 123-124) is an informative section covering that for x86_64. For i386 we can look at i386 syscalls.

With that information we can implement our own syscall().

arch/i386/core/linux/linux_syscall.S:

linux_syscall:
        /* Save registers */
        pushl   %ebx
        pushl   %esi
        pushl   %edi
        pushl   %ebp
 
        movl    20(%esp), %eax  // C arg1 -> syscall number
        movl    24(%esp), %ebx  // C arg2 -> syscall arg1
        movl    28(%esp), %ecx  // C arg3 -> syscall arg2
        movl    32(%esp), %edx  // C arg4 -> syscall arg3
        movl    36(%esp), %esi  // C arg5 -> syscall arg4
        movl    40(%esp), %edi  // C arg6 -> syscall arg5
        movl    44(%esp), %ebp  // C arg7 -> syscall arg6
 
        int     $0x80
 
        /* Restore registers */
        popl    %ebp
        popl    %edi
        popl    %esi
        popl    %ebx
 
        cmpl    $-4095, %eax
        jae     1f
        ret
 
1:
        negl    %eax
        movl    %eax, linux_errno
        movl    $-1, %eax
        ret

arch/x86_64/core/linux/linux_syscall.S:

linux_syscall:
        movq    %rdi, %rax    // C arg1 -> syscall number
        movq    %rsi, %rdi    // C arg2 -> syscall arg1
        movq    %rdx, %rsi    // C arg3 -> syscall arg2
        movq    %rcx, %rdx    // C arg4 -> syscall arg3
        movq    %r8, %r10     // C arg5 -> syscall arg4
        movq    %r9, %r8      // C arg6 -> syscall arg5
        movq    8(%rsp), %r9  // C arg7 -> syscall arg6
 
        syscall
 
        cmpq    $-4095, %rax
        jae     1f
        ret
 
1:
        negq    %rax
        movl    %eax, linux_errno
        movq    $-1, %rax
        ret

With that in place we can implement most of the functions as simple wrappers:

void * linux_mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)
{
        return (void*)linux_syscall(__SYSCALL_mmap, addr, length, prot, flags, fd, offset);
}
 
void * linux_mremap(void * old_address, size_t old_size, size_t new_size, int flags)
{
        return (void*)linux_syscall(__NR_mremap, old_address, old_size, new_size, flags);
}

Now you can see why our syscall() returns a long instead of an int. Otherwise we wouldn't be able to return a pointer on x86_64.

Subsystems

Having a kernel API in place, the next step is providing all the necessary subsystems on top of it.

Background

Subsystems provided by a PLATFORM can be seen in config/defaults/$PLATFORM.h. Let's look at one of them.

config/defaults/efi.h:

#define UACCESS_EFI
#define IOAPI_EFI
#define PCIAPI_EFI
#define CONSOLE_EFI
#define TIMER_EFI
#define NAP_EFIX86
#define UMALLOC_EFI
#define SMBIOS_EFI

For each subsystem there is, in general, a correspodning include/gpxe/$subsystem.h header which includes headers for specific implementations. Their location depends upon being ARCH-specific.

Most of the subsystems are single-implementation APIs, that is only one implementation of each can be used. See include/gpxe/api.h for details. CONSOLE is a bit different as every ARCH/PLATFORM can have many of them and hence have to use another widely adopted concept within gPXE, that is linker tables. Details in include/gpxe/tables.h. That header also explains why #ifdefs are bad and why so many objects are compiled despite not being used in the final target.

CONSOLE

CONSOLE is used for all the input and output that gPXE does. As I/O is trivial in userspace, LINUX_CONSOLE couldn't have been any different. Look at include/console.h for details on the API.

a bit simplified interface/linux/linux_console.c:

static void linux_putchar(int c) {
	linux_write(1, &c, 1);
}
static int linux_getchar() {
	char c;
	linux_read(0, &c, 1);
	return c;
}
struct console_driver linux_console __console_driver = {
	.putchar = linux_putchar,
	.getchar = linux_getchar,
};

TIMER

TIMER is about two things:

delaying execution:

void udelay(unsigned long usecs);

and a monotonically increasing counter (used for measuring time intervals mostly):

unsigned long currticks(void);
unsigned long ticks_per_sec(void);

udelay() trivially maps to (linux_)usleep().

currticks() is a bit trickier as there is no sensible way of getting the value of jiffies (the linux kernel tick counter) in userpace. Instead (linux_)gettimeofday() is used to emulate 1000 ticks per second starting on the first call to currticks().

UACCESS

UACCESS handles access to different kinds of memory. Currently this is a non-issue on Linux usermode as it accesses only the process memory, which has flat addressing.

UMALLOC

UMALLOC provides, as the name suggests, the well-known malloc gang:

userptr_t urealloc(userptr_t userptr, size_t new_size);
 
static inline userptr_t umalloc(size_t size) {
	return urealloc( UNULL, size);
}
static inline void ufree(userptr_t userptr) {
	urealloc( userptr, 0);
}

As can be seen only urealloc() needs to be implmeneted and it trivially maps to (linux_)realloc().

NAP

NAP is about giving the CPU a break

void cpu_nap(void);

In context of Linux usermode that means giving up the processor by the process, which can be achieved with a simple (linux_)usleep(0).

SMBIOS

SMBIOS doesn't seem to be used by anything currently. Linux implementation just returns an error.

IOAPI

Not used in Linux usermode currently.

PCIAPI

Not used in Linux usermode currently.

Networking

With the essentials in place, we can look at how networking is provided in Linux usermode.

Devices background

gPXE handles devices in a hierarchical manner. The building blocks are in include/gpxe/device.h.

strict device {
  ...
};
struct root_device {
	struct device dev;
	struct root_driver *driver;
};
struct root_driver {
	int (*probe)(struct root_device * rootdev);
	void (*remove)(struct root_device * rootdev);
};

The basic idea is that you have one root_device and a corresponding root_driver per BUS (or something else that makes sense, like Linux usermode).

The exact implementation is of course BUS specific, but a common way of doing things is having $BUS_devices and $BUS_drivers similarly to root_device and root_driver.

During initialization the root_driver's probe() scans the BUS for hardware. Upon finding a device it iterates over all $BUS_driver looking for the one that can handle it (e.g. in the PCI case based upon the pci-id of the device).

A matching driver is supposed to initialize the device. But even more importantly to it is supposed to register a new net_device, which represents a piece of networking hardware (or software in Linux usermode). The net_device is responsible for transmitting the actual data.

Linux usermode devices

Linux usermode devices follow the scheme described above. The only difference is that instead of physically scanning the BUS, the Linux root_driver just iterates over a list of requested devices based on the command line options.

The details can be seen in include/gpxe/linux.h and drivers/linux/linux.c.

Tap linux driver

Why tap?

Tap was chosen over raw sockets because it has many advantages and the only disadvantage is a bit harder setup:

  • possibility to connect to the localhost
  • easier to tcpdump
  • faster
  • doesn't have to be run with root powers

Implementation

The tap driver is as easy as it possibly gets.

drivers/linux/tap.c:

static int tap_transmit(struct net_device * netdev, struct io_buffer * iobuf)
{
	struct tap_nic * nic = netdev->priv;
	int rc;
 
	iob_pad(iobuf, ETH_ZLEN);
 
	rc = linux_write(nic->fd, iobuf->data, iobuf->tail - iobuf->data);
	DBGC(nic, "tap %p wrote %d bytes\n", nic, rc);
	netdev_tx_complete(netdev, iobuf);
 
	return 0;
}

In transmit() it can just send out the packet immediately with a simple (linux_)write().

static void tap_poll(struct net_device * netdev)
{
	struct tap_nic * nic = netdev->priv;
	int r;
	char buf[RX_BUF_SIZE];
	struct io_buffer * iobuf;
 
	while ((r = linux_read(nic->fd, buf, RX_BUF_SIZE)) > 0) {
		iobuf = alloc_iob(RX_BUF_SIZE);
		memcpy(iobuf->data, buf, r);
		iob_put(iobuf, r);
		netdev_rx(netdev, iobuf);
	}
}

In poll() it can just loop on a non-blocking (linux_)read() to get all the available packets.

Mapping the driver API to kernel

Work in progress.

Command line options

Command line options were introduced to control some aspects of the gPXE usermode. Currently the only option is for setting up a network device:

--net <driver>[,option=value[,option=value[,...]]]

The only driver currently is tap and it requires the if option so it's more like:

--net tap,if=<ifname>[,option=value[,option=value[,...]]]

Although if doesn't have to be the first option.

Multiple —net options can be passed.

Implementation

The implementation of parsing the command line options is pretty straightforward. It can be seen in hci/linux_args.c.

The other problem with stdlib

When linking with stdlib the only way of grabbing command line arguments is by modifying core/main.c, which isn't particularly nice:

#ifdef PLATFORM_linux
__asmcall int main ( int argc, char * argv[] ) {
#else
__asmcall int main ( void ) {
#endif
	...
#ifdef PLATFORM_linux
	if (parse_args(argc, argv) != 0) {
		return -1;
	}
#endif

It can be avoided by implementing own _start routine, which could save argc and argv somewhere accessible from a simple __init_fn (functions that are run as part of the initialization) and hence making the core/main.c modification unnecessary. That's part of the being self-contained work.