gPXE debugging

Sat Jun 10 07:40:52 UTC 2006

I've started to work on getting more drivers working in gPXE 0.5, and have come across an interesting problem.

I am not sure what is causing the problem, so I need to debug it. It occurred to me that this is a perfect opportunity to show how the process works, and get more people involved.

So, let's debug this together. I am doing this on real hardware. I have a client machine (our StUdly LinuxWorld demo PC, no less :), a server machine, and a serial cable from the client to another machine to capture gPXE output.

So, let's debug an actual problem. I have no idea what the problem is, and what I'd like is for you (yes, you!) to help me to figure out what is going on.

Now, note that I actually _enjoy_ debugging :) I enjoy it the way some people enjoy reading mystery novels. Rather than figuring out “who done it?”, I'm figuring out “what is actually going on?”. That to me is the secret and essence of good debugging. Rather than trying to “fix the bug”, you understand what the code is doing, and you will begin to know how to make it do something better.

So, let's get started.

I ran Michael's AoE demo, see http://sourceforge.net/mailarchive/forum.php?thread_id=11136042&forum_id=48022 which boots a Fedora image which lives on a server. It works fine, boots FC5, and uses the rtl8139 driver. When playing with drivers, it's good to have a case that works, because there are a lot of variables in the hardware/software equation.

So, I decided to compile the same driver using gPXE from CVS, and run the test with my compiled driver. I enabled the serial port in src/config.h:

#define CONSOLE_FIRMWARE        /* Default BIOS console */
#define CONSOLE_SERIAL          /* Serial port */

so that I could log Etherboot output to my laptop.

So I have three machines in the mix, in an arrangement that looks something like this:

    +-Serial-Cable--9600-8N1-+
    |                        |
+--------+  +--------+  +--------+
| client |  | server |  | laptop |
+--------+  +--------+  +--------+
     \           |           /
      \          |          /
       \         |         /
        \        |        /
      +--------------------+
      |    10/100 Switch   |
      +--------------------+

The server is running vblade, which serves up an AoE partition on a network for other machines to access. I am also compiling the bin/rtl8139.dsk images from gPXE cvs source on the server.

So, I downloaded the gPXE cvs source, and did:

    make DEBUG=rtl8139 bin/rtl8139.dsk

Then put a formatted floppy in the server drive, and did:

    cat bin/rtl8139.dsk > /dev/fd0

I then put the disk the client, and booted it:

This is the debugging output from gPXE:

EEPROM is an AT93C46
MAC address is 00c04963458d
RX ring at f6dfac4
Initialising AoE device e0.0
virt_to_bus ( pkb->data ) = f6e2000 align = 0
pkb_len=24 pad_len=1c
TX id 0 at f6e2000+40
TX id 0 complete
RX packet at offset 0+428
RX packet at offset 42c+6e
virt_to_bus ( pkb->data ) = f6e2000 align = 0
pkb_len=24 pad_len=1c
TX id 1 at f6e2000+40
TX id 1 complete
RX packet at offset 4a0+428
RX packet at offset 8cc+40
virt_to_bus ( pkb->data ) = f6e2000 align = 0
pkb_len=24 pad_len=1c
TX id 2 at f6e2000+40
TX id 2 complete
RX packet at offset 910+428
RX packet at offset d3c+40
virt_to_bus ( pkb->data ) = f6e2000 align = 0
pkb_len=24 pad_len=1c
TX id 3 at f6e2000+40
TX id 3 complete
RX packet at offset d80+40
RX packet at offset dc4+428
RX packet at offset 11f0+40
RX packet at offset 1234+40
virt_to_bus ( pkb->data ) = f6e2000 align = 0
pkb_len=24 pad_len=1c
TX id 0 at f6e2000+40
TX id 0 complete
RX packet at offset 1278+428
Could not reach AoE device e0.0
Press any key to exit

Now, the first question one might ask is, “What's it supposed to do?”

Well, it's supposed to talk to the AoE server (vblade in this case), download an image, and run it.

So, what does the protocol look like?

Here's a packet trace from tcpdump. Note that we have the MAC (ethernetE hardware) address of the client, so we can tell tcpdump to give us a trace of just packets to and from the server that match the clients MAC address, which is:

    00:c0:49:63:45:8d

Here's what a tcpdump session looks like:

$ sudo /usr/sbin/tcpdump ether host 00:c0:49:63:45:8d
tcpdump: listening on eth0
13:30:58.858667 0:c0:49:63:45:8d Broadcast 88a2 64:
                         1000 0000 0000 ebeb 0001 0000 01ec 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000
13:30:58.858887 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0001 0000 0140 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 5353 204e 4548
                         4552 2020 2020 2020 2020 2020 2020 0000
                         0000 0000 3156 0a30 2020 2020 6f43 6172
                         6469
13:30:59.266816 0:c0:49:63:45:8d Broadcast 88a2 64:
                         1000 0000 0000 ebeb 0002 0000 01ec 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000
13:30:59.266915 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0002 0000 0140 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 5353 204e 4548
                         4552 2020 2020 2020 2020 2020 2020 0000
                         0000 0000 3156 0a30 2020 2020 6f43 6172
                         6469
13:31:00.003411 0:c0:49:63:45:8d Broadcast 88a2 64:
                         1000 0000 0000 ebeb 0003 0000 01ec 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000
13:31:00.003513 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0003 0000 0140 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 5353 204e 4548
                         4552 2020 2020 2020 2020 2020 2020 0000
                         0000 0000 3156 0a30 2020 2020 6f43 6172
                         6469
13:31:01.540733 0:c0:49:63:45:8d Broadcast 88a2 64:
                         1000 0000 0000 ebeb 0004 0000 01ec 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000
13:31:01.540832 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0004 0000 0140 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 5353 204e 4548
                         4552 2020 2020 2020 2020 2020 2020 0000
                         0000 0000 3156 0a30 2020 2020 6f43 6172
                         6469
13:31:04.616461 0:c0:49:63:45:8d Broadcast 88a2 64:
                         1000 0000 0000 ebeb 0005 0000 01ec 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000
13:31:04.616565 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0005 0000 0140 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 5353 204e 4548
                         4552 2020 2020 2020 2020 2020 2020 0000
                         0000 0000 3156 0a30 2020 2020 6f43 6172
                         6469

Note that the client (00:c0:49:63:45:8d) keeps sending a broadcast to the server (00:11:11:77:aa:f3). The server responds, but the response is apparently not accepted by the client, because it continues broadcasting.

So, where do we start? Well, one thing to ask is, “what does a good packet trace look like?”.

Here's one from the AoE demo test image:

$ sudo /usr/sbin/tcpdump ether host 00:c0:49:63:45:8d
tcpdump: listening on eth0
13:42:16.706926 0:c0:49:63:45:8d Broadcast 88a2 60:
                         1000 0000 0000 ebeb 0001 0000 01ec 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000
13:42:16.707127 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0001 0000 0140 0000
                         00e0 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 5353 204e 4548
                         4552 2020 2020 2020 2020 2020 2020 0000
                         0000 0000 3156 0a30 2020 2020 6f43 6172
                         6469
13:42:16.792007 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60:
                         1000 0000 0000 ebeb 0002 4000 0124 0000
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000
13:42:16.792109 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0002 4000 0040 0000
                         0000 0000 0000 faeb 3112 0000 4c49 4c4f
                         1606 1000 0100 007c 0000 0000 0000 0000
                         0000 5eac 08c0 7409 b40e bb07 00cd 10eb
                         f2b9 1300 b486 cd15 cd18 31c0 8ed0 bc00
                         7cfb
13:42:16.792497 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60:
                         1000 0000 0000 ebeb 0003 4000 0124 0000
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000
13:42:16.792574 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0003 4000 0040 0000
                         0000 0000 0000 faeb 3112 0000 4c49 4c4f
                         1606 1000 0100 007c 0000 0000 0000 0000
                         0000 5eac 08c0 7409 b40e bb07 00cd 10eb
                         f2b9 1300 b486 cd15 cd18 31c0 8ed0 bc00
                         7cfb
13:42:16.792913 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60:
                         1000 0000 0000 ebeb 0004 4000 0124 3f00
                         0000 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000
13:42:16.792977 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0004 4000 0040 3f00
                         0000 0000 0000 faeb 2101 b501 4c49 4c4f
                         1606 463e 7f44 0000 0000 1e07 7f44 9792
                         8e54 c100 8260 d010 0c00 b8c0 078e d0bc
                         0008 fb52 5306 56fc 8ed8 31ed 60b8 0012
                         b336
13:42:16.794842 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60:
                         1000 0000 0000 ebeb 0005 4000 0124 d010
                         0c00 0000 0000 0000 0000 0000 0000 0000
                         0000 0000 0000 0000 0000 0000 0000
13:42:16.794912 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060:
                         1800 0000 0000 ebeb 0005 4000 0040 d010
                         0c00 0000 0000 d28e 0b00 ef8e 0b00 f08e
                         0b00 af8f 0b00 b08f 0b00 9b93 0b00 9c93
                         0b00 9d93 0b00 9e93 0b00 c710 0c00 c810
                         0c00 c910 0c00 ca10 0c00 cb10 0c00 cc10
                         0c00

Note that in this case, the client (00:c0:49:63:45:8d) only sends a single broadcast to the server (00:11:11:77:aa:f3). After that, it speaks directly to the server by address.

So, Dear Reader, I ask you to suggest a next debugging step. I know what I would do, but let's do this together. I'm going to get some sleep (It's 2:15 am Saturday morning here), and when I come back, I'll continue debugging according to suggestions from this list.

Now isn't this more interesting that reading along? :) You can look at the code, set up a debugging system like mine, draw from your own experience, or however you want.

I'll look forward to any and all suggestions to understand what's going on here. Why isn't the client responding appropriately? Remember, this same setup works with a different floppy, so we can pretty safely say the hardware is working.

So, what do you think we should do next (and why)? :)