====== Piotr Jaroszyński: Usermode debugging under Linux ======

====== How usermode under Linux is done ======

===== Intro =====

Porting any code to a substantially different environment is the hardest when no other ports have been done yet. Fortunately gPXE already supports two ''ARCH''s (''i386'' and ''x86_64'') 
and two ''PLATFORM''s (''pcbios'' on ''i386'' and ''efi'' on both).  Because of ''efi'' and ''pcbios'' differences extra layers making up for them have been already introduced.
That makes the linux usermode port, despite being quite different conceptually (usermode versus hardware), a much easier task.

Before focusing on the specific layers (called subsystems later for a lack of a better name), let's look at how the necessary kernel interface is provided first (it's not as trivial as one might think).

===== Kernel API =====

Regardless of the specific usage (discussed later in subsystems) some way of accessing the kernel is necessary.

==== Background ====

Because of gPXE nature it was designed and implemented to be completely self-contained. It doesn't link to stdlib (glibc) or to any other library.
That's a nice feature to have considering the crazy size constraints it has to meet.
For example it allows to compile gPXE with ''-mregparm=3'' and ''-mrtd'' flags, which reduce code size, but also make it incompatible with code compiled without them.

On the other hand availability of stdlib apis was necessary to make the programming environment feel natural and hence many of them were reimplemented internally.

==== linux_ prefix to the rescue ====

To avoid confusion (and in many cases collisions) between gPXE internals and kernel interface it was decided that all of the kernel API functions will be prefixed with ''linux_''. For example:

''include/linux_api.h'':
<code c>
extern int linux_open(const char *pathname, int flags);
extern int linux_close(int fd);
</code>
''include/gpxe/posix_io.h'':
<code c>
extern int open(const char *uri_string);
extern int close(int fd);
</code>

==== Linking to stdlib (glibc) ====

**UPDATE**: That approach has been moved to a separate ''linuxlibc'' ''PLATFORM'' and is available on the [[http://git.etherboot.org/?p=people/peper/gpxe.git;a=shortlog;h=refs/heads/linuxlibc|linuxlibc branch]].

Despite being non-trivial, forcing some compile flags to be disabled (namely ''-mrtd'' and ''-mregparm'' mentioned earlier) and having [[#the_other_problem_with_stdlib|some other problems]] linking to stdlib was still the quickest for prototyping.
It will also come in handy when debugging problems with the [[#being_self-contained|other superior approach]].

To work around the symbol collisions with stdlib, all the neccessary libs are copied with the offending symbols prefixed with ''linux_''. ''objcopy'' with ''--redefine-syms=remap_file'' is used to achieve that.

An example line from ''remap_file'' simply says:
  read linux_read

All the build/linker details can be seen in the ''arch/x86/Makefile.linux'':
<code>
MEDIA = linux

STDLIBS_BEGIN = $(BIN)/remapped_crt1.o $(BIN)/remapped_crti.o $(BIN)/remapped_crtbeginT.o
STDLIBS_LIBS = $(BIN)/remapped_libc.a $(BIN)/remapped_libgcc.a $(BIN)/remapped_libgcc_eh.a
STDLIBS_LIBS_L = $(foreach lib, $(STDLIBS_LIBS), -l:$(lib))
STDLIBS_END = $(BIN)/remapped_crtend.o $(BIN)/remapped_crtn.o

SYMBOLS_REMAP = arch/x86/linux/symbols_remap

$(BIN)/remapped_% : $(SYMBOLS_REMAP)
        $(QM)$(ECHO) "  [REMAP] $*"
        $(Q)objcopy --redefine-syms=$(SYMBOLS_REMAP) $(shell gcc $(CFLAGS) --print-file-name $*) $@

.PRECIOUS : $(BIN)/remapped_%

TGT_EXTRA_DEPS += $(STDLIBS_BEGIN) $(STDLIBS_LIBS) $(STDLIBS_END)
TGT_LD_FLAGS_PRE += -static $(STDLIBS_BEGIN)
TGT_LD_FLAGS_POST += --start-group $(STDLIBS_LIBS_L) --end-group $(STDLIBS_END)

$(BIN)/%.linux : $(BIN)/%.linux.tmp
        $(QM)$(ECHO) "  [FINISH] $@"
        $(Q)cp -p $< $@
</code>

=== Linker script ===

Amazingly the default ''ld'' scripts work just with the addition of tables (see ''include/gpxe/table.h'') in the ''.data'' section:
<code>
  .data           :
  {
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
    *(SORT(.tbl.*))
  }
</code>

=== Prefix ===

stdlib's ''_start'' takes care of everything so the prefix code is empty.


==== Being self-contained ====

To overcome the problems with linking to stdlib we need to implement some of its elementary features ourselves.

=== Linker script ===

A good read for starters is [[http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/gnu-linker/index.html|Using ld, the Gnu Linker]].
With that backgrund the currently used linker scirpts (''arch/*/scripts/*.lds'') should make more sense.

As we are not going to be linking against stdlib, the linker script should be really simple.
In fact it turned out that there is already a simple enough linker script used for efi (''arch/x86/scripts/efi.lds'') that can be used more or less out of the box.
The only necessary modification is setting the start of the Text segment properly, because not every value works (you can try ''0x0'' and see :)
We can see what's the convention by looking at how the default linker script does it
by passing ''--verbose'' to ''ld'' while compiling a simple program in 32bit and 64bit mode.

<code>
$ gcc -m32 foo.c -o foo -Wl,--verbose
$ gcc -m64 foo.c -o foo -Wl,--verbose
</code>

From that we can gather that ''i386'' uses ''0x08048000'' and ''x86_64'' uses ''0x400000'' as the start address.
I haven't been able to find a good explanation on why these are used in particular. Moreover many other values also seem to be working.
Other way of figuring out the specific values is reading [[http://www.sco.com/developers/devspecs/abi386-4.pdf|i386 ABI]] (page 48)
and [[http://www.x86-64.org/documentation/abi.pdf|AMD64 ABI]] (page 26).

=== Prefix (_start) ===

''_start'' being the default ''ENTRY'' point is the very first thing that's executed when a new process receives control.
What we want to do in ''_start'' is the minimal work necessary to actually call our ''main()'' function.

To accomplish that we need to know 3 things:
  * What's the state of things when ''_start'' is executed
  * How to actually call ''main()''
  * What to do when ''main()'' returns

The state of the stack and registers at the time of ''_start'' execution is descrbed in
[[http://www.sco.com/developers/devspecs/abi386-4.pdf|i386 ABI]] (page 54)
and [[http://www.x86-64.org/documentation/abi.pdf|AMD64 ABI]] (page 28).

The function calling convention is also desribed in the ABI docs: [[http://www.sco.com/developers/devspecs/abi386-4.pdf|i386 ABI]] (pages 36-38)
and [[http://www.x86-64.org/documentation/abi.pdf|AMD64 ABI]] (pages 15-23). A nice overview is [[http://www.agner.org/optimize/calling_conventions.pdf|calling conventions]].

What we need to do after ''main()'' returns is to call the ''exit'' syscall. Details on that are in the next section.

To actually make use of all that information we need to learn GNU Assembler first though.
I haven't been able to find any too good docs on it and certainly nothing resembling a tutorial.
Look at [[http://sig9.com/articles/att-syntax|quick syntax]], [[ftp://ftp.estec.esa.nl/pub/ws/wsd/erc32/doc/as.pdf|manual]] and [[http://tigcc.ticalc.org/doc/gnuasm.html|manual2]].

Following simplified ''_start''s should make sense now:

''arch/i386/prefix/linuxprefix.S'':
<code asm>
_start:
        xorl    %ebp, %ebp // ABI wants us to zero the base frame

        popl    %esi       // save argc
        movl    %esp, %edi // save argv

        pushl   %edi // argv -> C arg2
        pushl   %esi // argc -> C arg1

        call    main

        movl    %eax, %ebx // rc -> syscall arg1
        movl    $__NR_exit, %eax
        int     $0x80
</code>
''arch/x86_64/prefix/linuxprefix.S'':
<code asm>
_start:
        xorq    %rbp, %rbp // ABI wants us to zero the base frame

        popq    %rdi       // argc -> C arg1
        movq    %rsp, %rsi // argv -> C arg2

        call    main

        movq    %rax, %rdi // rc -> syscall arg1
        movq    $__NR_exit, %rax
        syscall
</code>

=== Syscalls ===

To provide the necessary kernel API (functions declared in ''include/linux_api.h'') we need a way to perform syscalls.

A simple way of doing that is implementing our own ''int syscall(int number, ...);''
as ''long linux_syscall(int number, ...);'' and using that as the building block.

The syscall calling conventions is a bit different than normal function calling convention on both ''i386'' and ''x86_64''.
The [[http://www.x86-64.org/documentation/abi.pdf|AMD64 ABI]] (pages 123-124) is an informative section covering that for ''x86_64''.
For ''i386'' we can look at [[http://www.cin.ufpe.br/~if817/arquivos/asmtut/index.html#syscalls|i386 syscalls]].

With that information we can implement our own ''syscall()''.

''arch/i386/core/linux/linux_syscall.S'':
<code asm>
linux_syscall:
        /* Save registers */
        pushl   %ebx
        pushl   %esi
        pushl   %edi
        pushl   %ebp

        movl    20(%esp), %eax  // C arg1 -> syscall number
        movl    24(%esp), %ebx  // C arg2 -> syscall arg1
        movl    28(%esp), %ecx  // C arg3 -> syscall arg2
        movl    32(%esp), %edx  // C arg4 -> syscall arg3
        movl    36(%esp), %esi  // C arg5 -> syscall arg4
        movl    40(%esp), %edi  // C arg6 -> syscall arg5
        movl    44(%esp), %ebp  // C arg7 -> syscall arg6

        int     $0x80

        /* Restore registers */
        popl    %ebp
        popl    %edi
        popl    %esi
        popl    %ebx

        cmpl    $-4095, %eax
        jae     1f
        ret

1:
        negl    %eax
        movl    %eax, linux_errno
        movl    $-1, %eax
        ret
</code>

''arch/x86_64/core/linux/linux_syscall.S'':
<code asm>
linux_syscall:
        movq    %rdi, %rax    // C arg1 -> syscall number
        movq    %rsi, %rdi    // C arg2 -> syscall arg1
        movq    %rdx, %rsi    // C arg3 -> syscall arg2
        movq    %rcx, %rdx    // C arg4 -> syscall arg3
        movq    %r8, %r10     // C arg5 -> syscall arg4
        movq    %r9, %r8      // C arg6 -> syscall arg5
        movq    8(%rsp), %r9  // C arg7 -> syscall arg6

        syscall

        cmpq    $-4095, %rax
        jae     1f
        ret

1:
        negq    %rax
        movl    %eax, linux_errno
        movq    $-1, %rax
        ret
</code>

With that in place we can implement most of the functions as simple wrappers: 
<code c>
void * linux_mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)
{
        return (void*)linux_syscall(__SYSCALL_mmap, addr, length, prot, flags, fd, offset);
}

void * linux_mremap(void * old_address, size_t old_size, size_t new_size, int flags)
{
        return (void*)linux_syscall(__NR_mremap, old_address, old_size, new_size, flags);
}
</code>
Now you can see why our ''syscall()'' returns a ''long'' instead of an ''int''. Otherwise we wouldn't be able to return a pointer on ''x86_64''.

===== Subsystems =====

Having a kernel API in place, the next step is providing all the necessary subsystems on top of it.

==== Background ====

Subsystems provided by a ''PLATFORM'' can be seen in ''config/defaults/$PLATFORM.h''.
Let's look at one of them.

''config/defaults/efi.h'':
<code>
#define UACCESS_EFI
#define IOAPI_EFI
#define PCIAPI_EFI
#define CONSOLE_EFI
#define TIMER_EFI
#define NAP_EFIX86
#define UMALLOC_EFI
#define SMBIOS_EFI
</code>

For each subsystem there is, in general, a correspodning ''include/gpxe/$subsystem.h''
 header which includes headers for specific implementations. Their location depends upon being ''ARCH''-specific.

Most of the subsystems are single-implementation APIs, that is only one implementation of each can be used. See ''include/gpxe/api.h'' for details.
''CONSOLE'' is a bit different as every ''ARCH''/''PLATFORM'' can have many of them and hence have to use another widely adopted concept within gPXE, that is linker tables.
Details in ''include/gpxe/tables.h''. That header also explains why ''#ifdef''s are bad and why so many objects are compiled despite not being used in the final target.

==== CONSOLE ====

''CONSOLE'' is used for all the input and output that gPXE does. As I/O is trivial in userspace, ''LINUX_CONSOLE'' couldn't have been any different.
Look at ''include/console.h'' for details on the API.

a bit simplified ''interface/linux/linux_console.c'':
<code c>
static void linux_putchar(int c) {
	linux_write(1, &c, 1);
}
static int linux_getchar() {
	char c;
	linux_read(0, &c, 1);
	return c;
}
struct console_driver linux_console __console_driver = {
	.putchar = linux_putchar,
	.getchar = linux_getchar,
};
</code>

==== TIMER ====

''TIMER'' is about two things:

delaying execution:
<code c>
void udelay(unsigned long usecs);
</code>

and a monotonically increasing counter (used for measuring time intervals mostly):
<code c>
unsigned long currticks(void);
unsigned long ticks_per_sec(void);
</code>

''udelay()'' trivially maps to ''(linux_)usleep()''.

''currticks()'' is a bit trickier as there is no sensible way of getting the value of ''jiffies'' (the linux kernel tick counter) in userpace.
Instead ''(linux_)gettimeofday()'' is used to emulate ''1000'' ticks per second starting on the first call to ''currticks()''.

==== UACCESS ====

''UACCESS'' handles access to different kinds of memory. Currently this is a non-issue on Linux usermode as it accesses only the process memory, which has flat addressing.

==== UMALLOC ====

''UMALLOC'' provides, as the name suggests, the well-known malloc gang:
<code c>
userptr_t urealloc(userptr_t userptr, size_t new_size);

static inline userptr_t umalloc(size_t size) {
	return urealloc( UNULL, size);
}
static inline void ufree(userptr_t userptr) {
	urealloc( userptr, 0);
}
</code>

As can be seen only ''urealloc()'' needs to be implmeneted and it trivially maps to ''(linux_)realloc()''.

==== NAP =====

''NAP'' is about giving the CPU a break
<code c>
void cpu_nap(void);
</code>

In context of Linux usermode that means giving up the processor by the process, which can be achieved with a simple ''(linux_)usleep(0)''.

==== SMBIOS ====

''SMBIOS'' doesn't seem to be used by anything currently. Linux implementation just returns an error.

==== IOAPI ====

Not used in Linux usermode currently.

==== PCIAPI ====

Not used in Linux usermode currently.

===== Networking =====

With the essentials in place, we can look at how networking is provided in Linux usermode.

==== Devices background ====

gPXE handles devices in a hierarchical manner. The building blocks are in ''include/gpxe/device.h''.
<code c>
strict device {
  ...
};
struct root_device {
	struct device dev;
	struct root_driver *driver;
};
struct root_driver {
	int (*probe)(struct root_device * rootdev);
	void (*remove)(struct root_device * rootdev);
};
</code>
The basic idea is that you have one ''root_device'' and a corresponding ''root_driver'' per BUS (or something else that makes sense, like Linux usermode).

The exact implementation is of course BUS specific, but a common way of doing things is having ''$BUS_device''s and ''$BUS_driver''s similarly to ''root_device'' and ''root_driver''.

During initialization the ''root_driver'''s ''probe()'' scans the BUS for hardware.
Upon finding a device it iterates over all ''$BUS_driver'' looking for the one that can handle it (e.g. in the PCI case based upon the pci-id of the device).

A matching driver is supposed to initialize the device. But even more importantly to it is supposed to register a new ''net_device'', which represents a piece of networking hardware (or software in Linux usermode).
The ''net_device'' is responsible for transmitting the actual data.

==== Linux usermode devices ====

Linux usermode devices follow the scheme described above.
The only difference is that instead of physically scanning the BUS, the Linux ''root_driver''
 just iterates over a list of requested devices based on the [[#command_line_options|command line options]].

The details can be seen in ''include/gpxe/linux.h'' and ''drivers/linux/linux.c''.

==== Tap linux driver ====

=== Why tap? ===

Tap was chosen over raw sockets because it has many advantages and the only disadvantage is a bit harder setup:
  * possibility to connect to the localhost
  * easier to tcpdump
  * faster
  * doesn't have to be run with root powers

=== Implementation ===

The tap driver is as easy as it possibly gets.

''drivers/linux/tap.c'':
<code c>
static int tap_transmit(struct net_device * netdev, struct io_buffer * iobuf)
{
	struct tap_nic * nic = netdev->priv;
	int rc;

	iob_pad(iobuf, ETH_ZLEN);

	rc = linux_write(nic->fd, iobuf->data, iobuf->tail - iobuf->data);
	DBGC(nic, "tap %p wrote %d bytes\n", nic, rc);
	netdev_tx_complete(netdev, iobuf);

	return 0;
}
</code>
In ''transmit()'' it can just send out the packet immediately with a simple ''(linux_)write()''.

<code c>
static void tap_poll(struct net_device * netdev)
{
	struct tap_nic * nic = netdev->priv;
	int r;
	char buf[RX_BUF_SIZE];
	struct io_buffer * iobuf;

	while ((r = linux_read(nic->fd, buf, RX_BUF_SIZE)) > 0) {
		iobuf = alloc_iob(RX_BUF_SIZE);
		memcpy(iobuf->data, buf, r);
		iob_put(iobuf, r);
		netdev_rx(netdev, iobuf);
	}
}
</code>

In ''poll()'' it can just loop on a non-blocking ''(linux_)read()'' to get all the available packets.

==== Mapping the driver API to kernel ====

Work in progress.

===== Command line options =====

Command line options were introduced to control some aspects of the gPXE usermode.
Currently the only option is for setting up a network device:
<code>
--net <driver>[,option=value[,option=value[,...]]]
</code>
The only driver currently is ''tap'' and it requires the ''if'' option so it's more like:
<code>
--net tap,if=<ifname>[,option=value[,option=value[,...]]]
</code>
Although ''if'' doesn't have to be the first option.

Multiple ''---net'' options can be passed.

==== Implementation ====

The implementation of parsing the command line options is pretty straightforward. It can be seen in ''hci/linux_args.c''.

==== The other problem with stdlib ====

When linking with stdlib the only way of grabbing command line arguments is by modifying ''core/main.c'', which isn't particularly nice:
<code c>
#ifdef PLATFORM_linux
__asmcall int main ( int argc, char * argv[] ) {
#else
__asmcall int main ( void ) {
#endif
	...
#ifdef PLATFORM_linux
	if (parse_args(argc, argv) != 0) {
		return -1;
	}
#endif
</code>

It can be avoided by implementing own ''_start'' routine, which could save ''argc'' and ''argv'' somewhere accessible
 from a simple ''__init_fn'' (functions that are run as part of the initialization) and hence making the ''core/main.c'' modification unnecessary.
That's part of the [[#being_self-contained|being self-contained work]].