Porting any code to a substantially different environment is the hardest when no other ports have been done yet. Fortunately gPXE already supports two ARCH
s (i386
and x86_64
)
and two PLATFORM
s (pcbios
on i386
and efi
on both). Because of efi
and pcbios
differences extra layers making up for them have been already introduced.
That makes the linux usermode port, despite being quite different conceptually (usermode versus hardware), a much easier task.
Before focusing on the specific layers (called subsystems later for a lack of a better name), let's look at how the necessary kernel interface is provided first (it's not as trivial as one might think).
Regardless of the specific usage (discussed later in subsystems) some way of accessing the kernel is necessary.
Because of gPXE nature it was designed and implemented to be completely self-contained. It doesn't link to stdlib (glibc) or to any other library.
That's a nice feature to have considering the crazy size constraints it has to meet.
For example it allows to compile gPXE with -mregparm=3
and -mrtd
flags, which reduce code size, but also make it incompatible with code compiled without them.
On the other hand availability of stdlib apis was necessary to make the programming environment feel natural and hence many of them were reimplemented internally.
To avoid confusion (and in many cases collisions) between gPXE internals and kernel interface it was decided that all of the kernel API functions will be prefixed with linux_
. For example:
include/linux_api.h
:
extern int linux_open(const char *pathname, int flags); extern int linux_close(int fd);
include/gpxe/posix_io.h
:
extern int open(const char *uri_string); extern int close(int fd);
UPDATE: That approach has been moved to a separate linuxlibc
PLATFORM
and is available on the linuxlibc branch.
Despite being non-trivial, forcing some compile flags to be disabled (namely -mrtd
and -mregparm
mentioned earlier) and having some other problems linking to stdlib was still the quickest for prototyping.
It will also come in handy when debugging problems with the other superior approach.
To work around the symbol collisions with stdlib, all the neccessary libs are copied with the offending symbols prefixed with linux_
. objcopy
with –redefine-syms=remap_file
is used to achieve that.
An example line from remap_file
simply says:
read linux_read
All the build/linker details can be seen in the arch/x86/Makefile.linux
:
MEDIA = linux STDLIBS_BEGIN = $(BIN)/remapped_crt1.o $(BIN)/remapped_crti.o $(BIN)/remapped_crtbeginT.o STDLIBS_LIBS = $(BIN)/remapped_libc.a $(BIN)/remapped_libgcc.a $(BIN)/remapped_libgcc_eh.a STDLIBS_LIBS_L = $(foreach lib, $(STDLIBS_LIBS), -l:$(lib)) STDLIBS_END = $(BIN)/remapped_crtend.o $(BIN)/remapped_crtn.o SYMBOLS_REMAP = arch/x86/linux/symbols_remap $(BIN)/remapped_% : $(SYMBOLS_REMAP) $(QM)$(ECHO) " [REMAP] $*" $(Q)objcopy --redefine-syms=$(SYMBOLS_REMAP) $(shell gcc $(CFLAGS) --print-file-name $*) $@ .PRECIOUS : $(BIN)/remapped_% TGT_EXTRA_DEPS += $(STDLIBS_BEGIN) $(STDLIBS_LIBS) $(STDLIBS_END) TGT_LD_FLAGS_PRE += -static $(STDLIBS_BEGIN) TGT_LD_FLAGS_POST += --start-group $(STDLIBS_LIBS_L) --end-group $(STDLIBS_END) $(BIN)/%.linux : $(BIN)/%.linux.tmp $(QM)$(ECHO) " [FINISH] $@" $(Q)cp -p $< $@
Amazingly the default ld
scripts work just with the addition of tables (see include/gpxe/table.h
) in the .data
section:
.data : { *(.data .data.* .gnu.linkonce.d.*) SORT(CONSTRUCTORS) *(SORT(.tbl.*)) }
stdlib's _start
takes care of everything so the prefix code is empty.
To overcome the problems with linking to stdlib we need to implement some of its elementary features ourselves.
A good read for starters is Using ld, the Gnu Linker.
With that backgrund the currently used linker scirpts (arch/*/scripts/*.lds
) should make more sense.
As we are not going to be linking against stdlib, the linker script should be really simple.
In fact it turned out that there is already a simple enough linker script used for efi (arch/x86/scripts/efi.lds
) that can be used more or less out of the box.
The only necessary modification is setting the start of the Text segment properly, because not every value works (you can try 0x0
and see :)
We can see what's the convention by looking at how the default linker script does it
by passing –verbose
to ld
while compiling a simple program in 32bit and 64bit mode.
$ gcc -m32 foo.c -o foo -Wl,--verbose $ gcc -m64 foo.c -o foo -Wl,--verbose
From that we can gather that i386
uses 0x08048000
and x86_64
uses 0x400000
as the start address.
I haven't been able to find a good explanation on why these are used in particular. Moreover many other values also seem to be working.
Other way of figuring out the specific values is reading i386 ABI (page 48)
and AMD64 ABI (page 26).
_start
being the default ENTRY
point is the very first thing that's executed when a new process receives control.
What we want to do in _start
is the minimal work necessary to actually call our main()
function.
To accomplish that we need to know 3 things:
_start
is executedmain()
main()
returns
The state of the stack and registers at the time of _start
execution is descrbed in
i386 ABI (page 54)
and AMD64 ABI (page 28).
The function calling convention is also desribed in the ABI docs: i386 ABI (pages 36-38) and AMD64 ABI (pages 15-23). A nice overview is calling conventions.
What we need to do after main()
returns is to call the exit
syscall. Details on that are in the next section.
To actually make use of all that information we need to learn GNU Assembler first though. I haven't been able to find any too good docs on it and certainly nothing resembling a tutorial. Look at quick syntax, manual and manual2.
Following simplified _start
s should make sense now:
arch/i386/prefix/linuxprefix.S
:
_start: xorl %ebp, %ebp // ABI wants us to zero the base frame popl %esi // save argc movl %esp, %edi // save argv pushl %edi // argv -> C arg2 pushl %esi // argc -> C arg1 call main movl %eax, %ebx // rc -> syscall arg1 movl $__NR_exit, %eax int $0x80
arch/x86_64/prefix/linuxprefix.S
:
_start: xorq %rbp, %rbp // ABI wants us to zero the base frame popq %rdi // argc -> C arg1 movq %rsp, %rsi // argv -> C arg2 call main movq %rax, %rdi // rc -> syscall arg1 movq $__NR_exit, %rax syscall
To provide the necessary kernel API (functions declared in include/linux_api.h
) we need a way to perform syscalls.
A simple way of doing that is implementing our own int syscall(int number, …);
as long linux_syscall(int number, …);
and using that as the building block.
The syscall calling conventions is a bit different than normal function calling convention on both i386
and x86_64
.
The AMD64 ABI (pages 123-124) is an informative section covering that for x86_64
.
For i386
we can look at i386 syscalls.
With that information we can implement our own syscall()
.
arch/i386/core/linux/linux_syscall.S
:
linux_syscall: /* Save registers */ pushl %ebx pushl %esi pushl %edi pushl %ebp movl 20(%esp), %eax // C arg1 -> syscall number movl 24(%esp), %ebx // C arg2 -> syscall arg1 movl 28(%esp), %ecx // C arg3 -> syscall arg2 movl 32(%esp), %edx // C arg4 -> syscall arg3 movl 36(%esp), %esi // C arg5 -> syscall arg4 movl 40(%esp), %edi // C arg6 -> syscall arg5 movl 44(%esp), %ebp // C arg7 -> syscall arg6 int $0x80 /* Restore registers */ popl %ebp popl %edi popl %esi popl %ebx cmpl $-4095, %eax jae 1f ret 1: negl %eax movl %eax, linux_errno movl $-1, %eax ret
arch/x86_64/core/linux/linux_syscall.S
:
linux_syscall: movq %rdi, %rax // C arg1 -> syscall number movq %rsi, %rdi // C arg2 -> syscall arg1 movq %rdx, %rsi // C arg3 -> syscall arg2 movq %rcx, %rdx // C arg4 -> syscall arg3 movq %r8, %r10 // C arg5 -> syscall arg4 movq %r9, %r8 // C arg6 -> syscall arg5 movq 8(%rsp), %r9 // C arg7 -> syscall arg6 syscall cmpq $-4095, %rax jae 1f ret 1: negq %rax movl %eax, linux_errno movq $-1, %rax ret
With that in place we can implement most of the functions as simple wrappers:
void * linux_mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset) { return (void*)linux_syscall(__SYSCALL_mmap, addr, length, prot, flags, fd, offset); } void * linux_mremap(void * old_address, size_t old_size, size_t new_size, int flags) { return (void*)linux_syscall(__NR_mremap, old_address, old_size, new_size, flags); }
Now you can see why our syscall()
returns a long
instead of an int
. Otherwise we wouldn't be able to return a pointer on x86_64
.
Having a kernel API in place, the next step is providing all the necessary subsystems on top of it.
Subsystems provided by a PLATFORM
can be seen in config/defaults/$PLATFORM.h
.
Let's look at one of them.
config/defaults/efi.h
:
#define UACCESS_EFI #define IOAPI_EFI #define PCIAPI_EFI #define CONSOLE_EFI #define TIMER_EFI #define NAP_EFIX86 #define UMALLOC_EFI #define SMBIOS_EFI
For each subsystem there is, in general, a correspodning include/gpxe/$subsystem.h
header which includes headers for specific implementations. Their location depends upon being ARCH
-specific.
Most of the subsystems are single-implementation APIs, that is only one implementation of each can be used. See include/gpxe/api.h
for details.
CONSOLE
is a bit different as every ARCH
/PLATFORM
can have many of them and hence have to use another widely adopted concept within gPXE, that is linker tables.
Details in include/gpxe/tables.h
. That header also explains why #ifdef
s are bad and why so many objects are compiled despite not being used in the final target.
CONSOLE
is used for all the input and output that gPXE does. As I/O is trivial in userspace, LINUX_CONSOLE
couldn't have been any different.
Look at include/console.h
for details on the API.
a bit simplified interface/linux/linux_console.c
:
static void linux_putchar(int c) { linux_write(1, &c, 1); } static int linux_getchar() { char c; linux_read(0, &c, 1); return c; } struct console_driver linux_console __console_driver = { .putchar = linux_putchar, .getchar = linux_getchar, };
TIMER
is about two things:
delaying execution:
void udelay(unsigned long usecs);
and a monotonically increasing counter (used for measuring time intervals mostly):
unsigned long currticks(void); unsigned long ticks_per_sec(void);
udelay()
trivially maps to (linux_)usleep()
.
currticks()
is a bit trickier as there is no sensible way of getting the value of jiffies
(the linux kernel tick counter) in userpace.
Instead (linux_)gettimeofday()
is used to emulate 1000
ticks per second starting on the first call to currticks()
.
UACCESS
handles access to different kinds of memory. Currently this is a non-issue on Linux usermode as it accesses only the process memory, which has flat addressing.
UMALLOC
provides, as the name suggests, the well-known malloc gang:
userptr_t urealloc(userptr_t userptr, size_t new_size); static inline userptr_t umalloc(size_t size) { return urealloc( UNULL, size); } static inline void ufree(userptr_t userptr) { urealloc( userptr, 0); }
As can be seen only urealloc()
needs to be implmeneted and it trivially maps to (linux_)realloc()
.
NAP
is about giving the CPU a break
void cpu_nap(void);
In context of Linux usermode that means giving up the processor by the process, which can be achieved with a simple (linux_)usleep(0)
.
SMBIOS
doesn't seem to be used by anything currently. Linux implementation just returns an error.
Not used in Linux usermode currently.
Not used in Linux usermode currently.
With the essentials in place, we can look at how networking is provided in Linux usermode.
gPXE handles devices in a hierarchical manner. The building blocks are in include/gpxe/device.h
.
strict device { ... }; struct root_device { struct device dev; struct root_driver *driver; }; struct root_driver { int (*probe)(struct root_device * rootdev); void (*remove)(struct root_device * rootdev); };
The basic idea is that you have one root_device
and a corresponding root_driver
per BUS (or something else that makes sense, like Linux usermode).
The exact implementation is of course BUS specific, but a common way of doing things is having $BUS_device
s and $BUS_driver
s similarly to root_device
and root_driver
.
During initialization the root_driver
's probe()
scans the BUS for hardware.
Upon finding a device it iterates over all $BUS_driver
looking for the one that can handle it (e.g. in the PCI case based upon the pci-id of the device).
A matching driver is supposed to initialize the device. But even more importantly to it is supposed to register a new net_device
, which represents a piece of networking hardware (or software in Linux usermode).
The net_device
is responsible for transmitting the actual data.
Linux usermode devices follow the scheme described above.
The only difference is that instead of physically scanning the BUS, the Linux root_driver
just iterates over a list of requested devices based on the command line options.
The details can be seen in include/gpxe/linux.h
and drivers/linux/linux.c
.
Tap was chosen over raw sockets because it has many advantages and the only disadvantage is a bit harder setup:
The tap driver is as easy as it possibly gets.
drivers/linux/tap.c
:
static int tap_transmit(struct net_device * netdev, struct io_buffer * iobuf) { struct tap_nic * nic = netdev->priv; int rc; iob_pad(iobuf, ETH_ZLEN); rc = linux_write(nic->fd, iobuf->data, iobuf->tail - iobuf->data); DBGC(nic, "tap %p wrote %d bytes\n", nic, rc); netdev_tx_complete(netdev, iobuf); return 0; }
In transmit()
it can just send out the packet immediately with a simple (linux_)write()
.
static void tap_poll(struct net_device * netdev) { struct tap_nic * nic = netdev->priv; int r; char buf[RX_BUF_SIZE]; struct io_buffer * iobuf; while ((r = linux_read(nic->fd, buf, RX_BUF_SIZE)) > 0) { iobuf = alloc_iob(RX_BUF_SIZE); memcpy(iobuf->data, buf, r); iob_put(iobuf, r); netdev_rx(netdev, iobuf); } }
In poll()
it can just loop on a non-blocking (linux_)read()
to get all the available packets.
Work in progress.
Command line options were introduced to control some aspects of the gPXE usermode. Currently the only option is for setting up a network device:
--net <driver>[,option=value[,option=value[,...]]]
The only driver currently is tap
and it requires the if
option so it's more like:
--net tap,if=<ifname>[,option=value[,option=value[,...]]]
Although if
doesn't have to be the first option.
Multiple —net
options can be passed.
The implementation of parsing the command line options is pretty straightforward. It can be seen in hci/linux_args.c
.
When linking with stdlib the only way of grabbing command line arguments is by modifying core/main.c
, which isn't particularly nice:
#ifdef PLATFORM_linux __asmcall int main ( int argc, char * argv[] ) { #else __asmcall int main ( void ) { #endif ... #ifdef PLATFORM_linux if (parse_args(argc, argv) != 0) { return -1; } #endif
It can be avoided by implementing own _start
routine, which could save argc
and argv
somewhere accessible
from a simple __init_fn
(functions that are run as part of the initialization) and hence making the core/main.c
modification unnecessary.
That's part of the being self-contained work.