Stefan Hajnoczi: GDB Remote Debugging
- Week 4
- Next week

Stefan Hajnoczi: GDB Remote Debugging

Week 4

Milestones:

Get latest GDB stub work into mainline.
Modern bzImage prefix for gPXE.

Mon Jun 16

The gdbstub2 branch is now ready for mainline review. Diffs against gPXE master are here. Once it is merged I will update the documentation and encourage others to use GDB.

gPXE needs modern bzImage support so that GRUB, lilo, and SYSLINUX can load it. This is my next piece of work after the GDB stub. There is already code in etherboot to make a bzImage. The old code doesn't work by default on today's popular bootloaders since the Linux bzImage header it supplies is outdated. I am investigating what needs to be done for GRUB, lilo, SYSLINUX, etherboot, and gPXE to load a gPXE bzImage.

Tue Jun 17

Git commit:

[bzImage] Make gpxe.lkrn a zImage 2.07

I am trying out bootloaders on gpxe.lkrn images. We were afraid that the outdated Linux zImage prefix no longer works with modern bootloaders. Here are results for unmodified gPXE (I have not yet attempted to implement bzImage):

GRUB boots gpxe.lkrn successfully. Here is a script to create a GRUB/gPXE boot floppy:

#!/bin/sh
set -e
dd if=/dev/zero of=grub.img bs=1024 count=1440
losetup /dev/loop0 grub.img
mkfs /dev/loop0
mount /dev/loop0 /mnt
mkdir -p /mnt/boot/grub
cp /boot/grub/stage1 /boot/grub/stage2 /mnt/boot/grub/
cat >/mnt/boot/grub/menu.lst <<EOF
title=gPXE
root (fd0)
kernel /boot/gpxe.lkrn
EOF
cp bin/gpxe.lkrn /mnt/boot/
umount /mnt
grub --device-map=/dev/null <<EOF
device (fd0) /dev/loop0
root (fd0)
setup (fd0)
quit
EOF
losetup -d /dev/loop0

SYSLINUX boots gpxe.lkrn successfully. Here is a script to create a boot floppy:

#!/bin/sh
set -e
dd if=/dev/zero of=syslinux.img bs=1024 count=1440
mkfs.msdos syslinux.img
mount -o loop syslinux.img /mnt
cp bin/gpxe.lkrn /mnt/gpxe.zi
cat >/mnt/SYSLINUX.CFG <<EOF
default gpxe.zi
EOF
umount /mnt
syslinux syslinux.img

lilo boots gpxe.lkrn unsuccessfully. QEMU stops with a triple-fault. I still need to look into this. Here is a script to create a boot floppy:

#!/bin/sh
set -e
dd if=/dev/zero of=lilo.img bs=1024 count=1440
losetup /dev/loop0 lilo.img
mkfs /dev/loop0
mount /dev/loop0 /mnt
mkdir /mnt/etc /mnt/boot
cp bin/gpxe.lkrn /mnt/gpxe.zi
cat >/mnt/etc/lilo.conf <<EOF
boot    =/dev/loop0       
disk    =/dev/loop0      
bios    =0x00           # 1.44MB disk geometry
sectors =18
heads   =2
cylinders =80
install =/mnt/boot/boot.b        
map     =/mnt/boot/map   
backup  =/dev/null       
image   =/mnt/gpxe.zi
EOF
/tmp/lilo/sbin/lilo -C /mnt/etc/lilo.conf
umount /mnt
losetup -d /dev/loop0

gPXE boots gpxe.lkrn unsuccessfully since only the newer bzImage and not the old zImage format is supported. Testing was easy:

qemu -bootp gpxe.lkrn -tftp bin bin/gpxe.usb

Etherboot 5.4.3 boots gpxe.lkrn successfully. I used wraplinux to make an NBI file from gpxe.lkrn.

Updated lkrnprefix.S to zImage 2.07. The image is still only a zImage since the non-real code loads at 0x10000. A bzImage loads non-real code at 0x100000, i.e. right after the 1 MB low memory. Perhaps gpxe.lkrn can be a full bzImage, but I think that the A20 line will prevent us from accessing 0x100000.

GRUB boots successfully.
Lilo still fails. I need to investigate this, probably I'm not using it properly.
SYSLINUX boots successfully.
gPXE boots successfully with a small patch to bzimage.c. Need to discuss this with mcb30.
Etherboot boots successfully.

Wed Jun 18

Lilo still triple-faults when loading gpxe.lkrn. I set up a virtual machine with Damn Small Linux to ensure a clean environment. The DSL kernel is boots successfully while gpxe.lkrn fails. Here is the triple fault information from QEMU:

qemu: fatal: triple fault
EAX=60000000 EBX=0000fee8 ECX=00002900 EDX=00001d8a
ESI=0001ffff EDI=0000ff51 EBP=0000f9c4 ESP=0000f96e
EIP=0000074c EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 00000000 ffffffff 00cf9300
CS =0008 0000f600 0000ffff 00009b00
SS =0010 00090000 0000ffff 00009309
DS =0018 00000000 ffffffff 00cf9300
FS =0018 00000000 ffffffff 00cf9300
GS =0018 00000000 ffffffff 00cf9300
LDT=0000 00000000 0000ffff 00008000
TR =0000 00000000 00000000 00000000
GDT=     0009f99c 0000001f
IDT=     00000000 000003ff
CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
CCS=00000000 CCD=0000f97e CCO=ADDB    
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000
XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000
XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000
XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000
Aborted

I don't see an obvious clue in the crash dump, so I'll wait until after speaking with mcb30 about bzImage. If we decide to go in a different direction then I'd waste time debugging this.

In the meantime I'll investigate real-mode GDB debugging. I already tried set architecture i8086 for 16-bit disassembly. GDB still treats memory as a flat 32-bit space and will probably require some address translation inside the GDB stub.

Another thought I'm holding is that loading gpxe.lkrn recursively fails. That potentially means you cannot load another zImage after gPXE has been loaded from gpxe.lkrn. ~~My theory is that gPXE has been loaded to the default zImage load address, i.e. 0x10000. If gPXE then tries to load another image there, it overwrites itself and crashes~~. It looks unlikely that gPXE is overwriting itself because it relocates as high up as possible.

Thu Jun 19

Git commit: [b44] Create skeleton driver for Broadcom 4401 NIC

Brought up ROM-o-matic for Etherboot top-of-git-tree. I have been occasionally assisting mdc with his ROM-o-matic.net online boot ROM generator. He recently enabled ROM-o-matic for gPXE top-of-git-tree. That way users can get ROMs for the latest development version of gPXE without having to set up a development environment and build from source. This is now also possible for Etherboot.

Beginning work to port Linux b44 (Broadcom 4401) driver. My laptop has a BCM4401-B0 NIC and is currently not supported by gPXE. The idea is to port the Linux driver to gPXE. I am looking forward to learning more about network drivers and device driver development in general.

Fri Jun 20

Git commit: [b44] Minimal TX path

The b44 driver is transmitting Ethernet frames. Thanks to Michael Decker's excellent gPXE Driver API Documentation I got the skeleton for the driver working very quickly last night. This morning I started porting the Linux b44 driver code.

After getting the initialization working (mainly by copy-paste) and reading the MAC address from the card, I decided to pursue the TX path. Getting transmit working early is useful since gPXE will attempt to do DHCP automatically and therefore needs to send packets.

Copy-pasting the Linux driver was not a good tactic since the Linux code is much more complex. Eventually I just focused on understanding how the hardware supports transmitting frames (there is no public documentation available!), and then implemented a simple TX path resembling the gPXE natsemi driver.

Sat Jun 21

Git commit: [b44] Working RX path

The b44 driver is receiving Ethernet frames. I just booted PXELINUX and HTTP-booted Linux 2.6.25 on this card for the first time! Getting the RX path working has been painful.

I think some of the Linux driver code is misleading/incorrect. Luckily there are drivers for OpenBSD, FreeBSD, and Solaris. Those drivers might even be based on the Linux driver, but they do some things differently and it helps to compare them to each other. My main issue with the RX path was a comment in the Linux driver claiming that the hardware writes a header structure 30 bytes before the DMA address of the I/O buffer.

This is false. The Linux driver does offset the DMA address by 30 bytes, but it also offsets the IO buffer by 30 bytes. In the end, it makes no difference and all that has happened is that 30 bytes of the IO buffer have been wasted. The header structure gets written to the DMA address, not before it.

The next steps for the b44 driver are cleaning it up, making it robust, and testing. Most of the initialization code is straight from the Linux driver. I want to get to grips with it and then simplify it for gPXE.

I have omitted performance optimizations from the Linux driver. The Linux driver has a “copy threshold” which dictates whether to copy a received packet to a fresh IO buf to hand off to the network stack, or whether to remove the current IO buf from the RX ring and pass it straight to the network stack (and allocating a fresh IO buf for the RX ring). I'll talk to Balaji about performance measurement since he's been optimizing his USB driver.

Lilo bzImage debugging still underway. I made a little bit of progress tonight by determining that the triple-fault happens in the call to install. I think that EIP goes crazy somewhere inside install and hence the triple fault. I'm sure the issue triggers inside install since I've placed infinite loops before and after the call. The loop after the call never happens.

My current debugging cycle is by booting up Damn Small Linux in QEMU and copying over my latest gpxe.lkrn, running lilo, and rebooting into gPXE. This is slow and frustrating. I need to script it but my DSL install seems to be read-only.

Next week

On to Week 5.

Table of Contents