====== gPXE debugging ====== === Sat Jun 10 07:40:52 UTC 2006 === I've started to work on getting more drivers working in gPXE 0.5, and have come across an interesting problem. I am not sure what is causing the problem, so I need to debug it. It occurred to me that this is a perfect opportunity to show how the process works, and get more people involved. So, let's debug this together. I am doing this on real hardware. I have a client machine (our StUdly LinuxWorld demo PC, no less :), a server machine, and a serial cable from the client to another machine to capture gPXE output. So, let's debug an actual problem. I have no idea what the problem is, and what I'd like is for you (yes, you!) to help me to figure out what is going on. Now, note that I actually _enjoy_ debugging :) I enjoy it the way some people enjoy reading mystery novels. Rather than figuring out "who done it?", I'm figuring out "what is actually going on?". That to me is the secret and essence of good debugging. Rather than trying to "fix the bug", you understand what the code is doing, and you will begin to know how to make it do something better. So, let's get started. I ran Michael's AoE demo, see [[http://sourceforge.net/mailarchive/forum.php?thread_id=11136042&forum_id=48022]] which boots a Fedora image which lives on a server. It works fine, boots FC5, and uses the rtl8139 driver. When playing with drivers, it's good to have a case that works, because there are a lot of variables in the hardware/software equation. So, I decided to compile the same driver using gPXE from CVS, and run the test with my compiled driver. I enabled the serial port in src/config.h: #define CONSOLE_FIRMWARE /* Default BIOS console */ #define CONSOLE_SERIAL /* Serial port */ so that I could log Etherboot output to my laptop. So I have three machines in the mix, in an arrangement that looks something like this: +-Serial-Cable--9600-8N1-+ | | +--------+ +--------+ +--------+ | client | | server | | laptop | +--------+ +--------+ +--------+ \ | / \ | / \ | / \ | / +--------------------+ | 10/100 Switch | +--------------------+ The server is running vblade, which serves up an AoE partition on a network for other machines to access. I am also compiling the bin/rtl8139.dsk images from gPXE cvs source on the server. So, I downloaded the gPXE cvs source, and did: make DEBUG=rtl8139 bin/rtl8139.dsk Then put a formatted floppy in the server drive, and did: cat bin/rtl8139.dsk > /dev/fd0 I then put the disk the client, and booted it: This is the debugging output from gPXE: EEPROM is an AT93C46 MAC address is 00c04963458d RX ring at f6dfac4 Initialising AoE device e0.0 virt_to_bus ( pkb->data ) = f6e2000 align = 0 pkb_len=24 pad_len=1c TX id 0 at f6e2000+40 TX id 0 complete RX packet at offset 0+428 RX packet at offset 42c+6e virt_to_bus ( pkb->data ) = f6e2000 align = 0 pkb_len=24 pad_len=1c TX id 1 at f6e2000+40 TX id 1 complete RX packet at offset 4a0+428 RX packet at offset 8cc+40 virt_to_bus ( pkb->data ) = f6e2000 align = 0 pkb_len=24 pad_len=1c TX id 2 at f6e2000+40 TX id 2 complete RX packet at offset 910+428 RX packet at offset d3c+40 virt_to_bus ( pkb->data ) = f6e2000 align = 0 pkb_len=24 pad_len=1c TX id 3 at f6e2000+40 TX id 3 complete RX packet at offset d80+40 RX packet at offset dc4+428 RX packet at offset 11f0+40 RX packet at offset 1234+40 virt_to_bus ( pkb->data ) = f6e2000 align = 0 pkb_len=24 pad_len=1c TX id 0 at f6e2000+40 TX id 0 complete RX packet at offset 1278+428 Could not reach AoE device e0.0 Press any key to exit Now, the first question one might ask is, "What's it supposed to do?" Well, it's supposed to talk to the AoE server (vblade in this case), download an image, and run it. So, what does the protocol look like? Here's a packet trace from tcpdump. Note that we have the MAC (ethernetE hardware) address of the client, so we can tell tcpdump to give us a trace of just packets to and from the server that match the clients MAC address, which is: 00:c0:49:63:45:8d Here's what a tcpdump session looks like: $ sudo /usr/sbin/tcpdump ether host 00:c0:49:63:45:8d tcpdump: listening on eth0 13:30:58.858667 0:c0:49:63:45:8d Broadcast 88a2 64: 1000 0000 0000 ebeb 0001 0000 01ec 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:30:58.858887 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0001 0000 0140 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 5353 204e 4548 4552 2020 2020 2020 2020 2020 2020 0000 0000 0000 3156 0a30 2020 2020 6f43 6172 6469 13:30:59.266816 0:c0:49:63:45:8d Broadcast 88a2 64: 1000 0000 0000 ebeb 0002 0000 01ec 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:30:59.266915 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0002 0000 0140 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 5353 204e 4548 4552 2020 2020 2020 2020 2020 2020 0000 0000 0000 3156 0a30 2020 2020 6f43 6172 6469 13:31:00.003411 0:c0:49:63:45:8d Broadcast 88a2 64: 1000 0000 0000 ebeb 0003 0000 01ec 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:31:00.003513 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0003 0000 0140 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 5353 204e 4548 4552 2020 2020 2020 2020 2020 2020 0000 0000 0000 3156 0a30 2020 2020 6f43 6172 6469 13:31:01.540733 0:c0:49:63:45:8d Broadcast 88a2 64: 1000 0000 0000 ebeb 0004 0000 01ec 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:31:01.540832 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0004 0000 0140 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 5353 204e 4548 4552 2020 2020 2020 2020 2020 2020 0000 0000 0000 3156 0a30 2020 2020 6f43 6172 6469 13:31:04.616461 0:c0:49:63:45:8d Broadcast 88a2 64: 1000 0000 0000 ebeb 0005 0000 01ec 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:31:04.616565 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0005 0000 0140 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 5353 204e 4548 4552 2020 2020 2020 2020 2020 2020 0000 0000 0000 3156 0a30 2020 2020 6f43 6172 6469 Note that the client (00:c0:49:63:45:8d) keeps sending a broadcast to the server (00:11:11:77:aa:f3). The server responds, but the response is apparently not accepted by the client, because it continues broadcasting. So, where do we start? Well, one thing to ask is, "what does a good packet trace look like?". Here's one from the AoE demo test image: $ sudo /usr/sbin/tcpdump ether host 00:c0:49:63:45:8d tcpdump: listening on eth0 13:42:16.706926 0:c0:49:63:45:8d Broadcast 88a2 60: 1000 0000 0000 ebeb 0001 0000 01ec 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:42:16.707127 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0001 0000 0140 0000 00e0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 5353 204e 4548 4552 2020 2020 2020 2020 2020 2020 0000 0000 0000 3156 0a30 2020 2020 6f43 6172 6469 13:42:16.792007 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60: 1000 0000 0000 ebeb 0002 4000 0124 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:42:16.792109 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0002 4000 0040 0000 0000 0000 0000 faeb 3112 0000 4c49 4c4f 1606 1000 0100 007c 0000 0000 0000 0000 0000 5eac 08c0 7409 b40e bb07 00cd 10eb f2b9 1300 b486 cd15 cd18 31c0 8ed0 bc00 7cfb 13:42:16.792497 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60: 1000 0000 0000 ebeb 0003 4000 0124 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:42:16.792574 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0003 4000 0040 0000 0000 0000 0000 faeb 3112 0000 4c49 4c4f 1606 1000 0100 007c 0000 0000 0000 0000 0000 5eac 08c0 7409 b40e bb07 00cd 10eb f2b9 1300 b486 cd15 cd18 31c0 8ed0 bc00 7cfb 13:42:16.792913 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60: 1000 0000 0000 ebeb 0004 4000 0124 3f00 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:42:16.792977 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0004 4000 0040 3f00 0000 0000 0000 faeb 2101 b501 4c49 4c4f 1606 463e 7f44 0000 0000 1e07 7f44 9792 8e54 c100 8260 d010 0c00 b8c0 078e d0bc 0008 fb52 5306 56fc 8ed8 31ed 60b8 0012 b336 13:42:16.794842 0:c0:49:63:45:8d 0:11:11:77:aa:f3 88a2 60: 1000 0000 0000 ebeb 0005 4000 0124 d010 0c00 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 13:42:16.794912 0:11:11:77:aa:f3 0:c0:49:63:45:8d 88a2 1060: 1800 0000 0000 ebeb 0005 4000 0040 d010 0c00 0000 0000 d28e 0b00 ef8e 0b00 f08e 0b00 af8f 0b00 b08f 0b00 9b93 0b00 9c93 0b00 9d93 0b00 9e93 0b00 c710 0c00 c810 0c00 c910 0c00 ca10 0c00 cb10 0c00 cc10 0c00 Note that in this case, the client (00:c0:49:63:45:8d) only sends a single broadcast to the server (00:11:11:77:aa:f3). After that, it speaks directly to the server by address. So, Dear Reader, I ask you to suggest a next debugging step. I know what I would do, but let's do this together. I'm going to get some sleep (It's 2:15 am Saturday morning here), and when I come back, I'll continue debugging according to suggestions from this list. Now isn't this more interesting that reading along? :) You can look at the code, set up a debugging system like mine, draw from your own experience, or however you want. I'll look forward to any and all suggestions to understand what's going on here. Why isn't the client responding appropriately? Remember, this same setup works with a different floppy, so we can pretty safely say the hardware is working. So, what do you think we should do next (and why)? :)